knitr::opts_chunk$set(
    message = FALSE,
    warning = FALSE
)

1. Introduction

The dataset, found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29, consists of features which were computed from digitized images of fine needle aspirate (FNA) procedure of breast mass. The features describe characteristics of the cell nuclei present in the images.

Ten real-valued features are computed for each cell nucleus. The following nuclear features were analyzed:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)

The dataset has the following attribute information:

(a) Number of instances: 569

(b) Number of attributes: 32

  1. ID

  2. diagnosis: The diagnosis of breast tissues (M = malignant, B = benign) - Class distribution: 357 benign, 212 malignant

  3. radius_mean: mean of distances from center to points on the perimeter
  4. texture_mean: standard deviation of gray-scale values
  5. perimeter_mean: mean size of the core tumor
  6. area_mean
  7. smoothness_mean: mean of local variation in radius lengths
  8. compactness_mean: mean of perimeter^2 / area - 1.0
  9. concavity_mean: mean of severity of concave portions of the contour
  10. concave points_mean: mean for number of concave portions of the contour
  11. symmetry_mean
  12. fractal_dimension_mean: mean for “coastline approximation” - 1

  13. radius_se: standard error for the mean of distances from center to points on the perimeter
  14. texture_se: standard error for standard deviation of gray-scale values
  15. perimeter_se
  16. area_se
  17. smoothness_se: standard error for local variation in radius lengths
  18. compactness_se: standard error for perimeter^2 / area - 1.0
  19. concavity_se: standard error for severity of concave portions of the contour
  20. concave points_se: standard error for number of concave portions of the contour
  21. symmetry_se
  22. fractal_dimension_se: standard error for “coastline approximation” - 1

  23. radius_worst: “worst” or largest mean value for mean of distances from center to points on the perimeter
  24. texture_worst: “worst” or largest mean value for standard deviation of gray-scale values
  25. perimeter_worst
  26. area_worst
  27. smoothness_worst: “worst” or largest mean value for local variation in radius lengths
  28. compactness_worst: “worst” or largest mean value for perimeter^2 / area - 1.0
  29. concavity_worst: “worst” or largest mean value for severity of concave portions of the contour
  30. concave points_worst: “worst” or largest mean value for number of concave portions of the contour
  31. symmetry_worst
  32. fractal_dimension_worst: “worst” or largest mean value for “coastline approximation”

Breast cancer is the most-common invasive cancer in women and affect about 12% of women worldwide (McGuire, A; Brown, JA; Malone, C; McLaughlin, R; Kerin, MJ (22 May 2015). “Effects of age on the detection and management of breast cancer”. Cancers. 7 (2): 908–29. doi:10.3390/cancers7020815).

The fine needle aspiration (FNA) procedure helps establish the breast cancer diagnosis. Together physical examination of breasts and mammography, FNAC can be used to diagnose breast cancer with a good degree of accuracy.

A well-described characteristics in terms of the cell nuclei through digitized image, including the establishment of patterns/models, can helps improve breast cancer diagnosis.

Figures 1, 2 and 3. Digital images from a breast FNA. M. W. Teague, W. H. Wolberg, W. N. Street, O. L. Mangasarian, S. Labremont, and D. L. Page. Indeterminate fine needle aspiration of the breast: Image analysis aided diagnosis. Cancer Cytopathology 81: 129-135, 1997. W. N. Street. Xcyt: A System for Remote Cytological Diagnosis and Prognosis of Breast Cancer. Management Sciences Department. University of Iowa, Iowa City, IA.

Figures 1, 2 and 3. Digital images from a breast FNA. M. W. Teague, W. H. Wolberg, W. N. Street, O. L. Mangasarian, S. Labremont, and D. L. Page. Indeterminate fine needle aspiration of the breast: Image analysis aided diagnosis. Cancer Cytopathology 81: 129-135, 1997. W. N. Street. Xcyt: A System for Remote Cytological Diagnosis and Prognosis of Breast Cancer. Management Sciences Department. University of Iowa, Iowa City, IA.

Objective: Analyse cell nuclei characteristics, and if possible, identify patterns related to the diagnosis of breast tissues (malignant or benign). Additionally, it will be proposed a machine learning models for diagnosis.

2. Loading Packages and Dataset

library(dplyr)
library(ggplot2)
library(tidyverse)
library(formattable)
library(reshape2)
library(pander)
library(ggpubr)
library(ggpmisc)
library(ltm)
library(randomForest)
library(GGally)
library(RColorBrewer)
library(car)
library(corrplot)
library(factoextra)
library(FactoMineR)
library(caret)
library(rpart)
library(rpart.plot)
library(gridExtra)
library(DT)
setwd("C:/Users/bdeta/Documents/R/Projects/2 - Breast Cancer")
df <- as.data.frame(read_csv("data.csv"))

3. Exploratory Data Analysis (EDA)

df <- subset(df, select = -X33) # Remove the column X33 (NAs)
df$diagnosis <- as.factor(df$diagnosis) # Transform chr to factor
names(df) <- gsub(" ", "_", names(df)) # Fix spaces in column names
summary(df)
##        id            diagnosis  radius_mean      texture_mean  
##  Min.   :     8670   B:357     Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   M:212     1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024             Median :13.370   Median :18.84  
##  Mean   : 30371831             Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129             3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502             Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave_points_mean symmetry_mean   
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060  
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619  
##  Median :0.06154   Median :0.03350     Median :0.1792  
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812  
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957  
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040  
##  fractal_dimension_mean   radius_se        texture_se      perimeter_se   
##  Min.   :0.04996        Min.   :0.1115   Min.   :0.3602   Min.   : 0.757  
##  1st Qu.:0.05770        1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606  
##  Median :0.06154        Median :0.3242   Median :1.1080   Median : 2.287  
##  Mean   :0.06280        Mean   :0.4052   Mean   :1.2169   Mean   : 2.866  
##  3rd Qu.:0.06612        3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357  
##  Max.   :0.09744        Max.   :2.8730   Max.   :4.8850   Max.   :21.980  
##     area_se        smoothness_se      compactness_se      concavity_se    
##  Min.   :  6.802   Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
##  1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
##  Median : 24.530   Median :0.006380   Median :0.020450   Median :0.02589  
##  Mean   : 40.337   Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
##  3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
##  Max.   :542.200   Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
##  concave_points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave_points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst
##  Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2822   Median :0.08004        
##  Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.6638   Max.   :0.20750

3.1. Nuclear features analysis:

1. radius

radius <- df %>%
 dplyr::select(c(diagnosis, radius_mean, radius_se, radius_worst)) %>%
 group_by(diagnosis) %>%
 summarise(Mean_radius_mean = mean(radius_mean), Mean_radius_se = mean(radius_se), Mean_radius_worst = mean(radius_worst))

formattable(radius, list(
 diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
 Mean_radius_mean = color_tile("#f7d383", "#fec306"),
 Mean_radius_se = color_tile("#eb724d", "#df5227"),
 Mean_radius_worst = color_tile("#b8ddf2", "#56B4E9")))
diagnosis Mean_radius_mean Mean_radius_se Mean_radius_worst
B 12.14652 0.2840824 13.37980
M 17.46283 0.6090825 21.13481

The mean of radius variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.

test.m <- melt(df,id.vars='diagnosis', measure.vars=c('radius_mean','radius_se','radius_worst'))

ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
  geom_boxplot(alpha = 2/3) +
  labs(x = 'diagnosis') +
  scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  theme_bw() + ggtitle("diagnosis x radius variables") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  geom_jitter(alpha = I(1/4), aes(color = variable)) +
  stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))

Higher variability/spread for radius variables (mean, se, worst) was observed in the malignant breast cancer group.

ggplot(test.m, aes(x=value)) +
  geom_histogram(binwidth=2, aes(y=..density..), position="identity", alpha=0.7, color="black") +
  geom_density(alpha=0.4, color = NA) +
  labs(x = "", y = "Count", title = 'Distribution of the radius variables') + theme_bw() +
  aes(fill = variable) +
  scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  ylim(0, 0.5)

shapiro.tests <- t(as.data.frame(lapply(df[,c("radius_mean", "radius_se", "radius_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
##                   p-value
## radius_mean  3.105644e-14
## radius_se    1.224597e-28
## radius_worst 1.704294e-17

Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the radius variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).

wilcox.tests <- t(as.data.frame(lapply(df[,c("radius_mean", "radius_se", "radius_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
##                   p-value
## radius_mean  2.692943e-68
## radius_se    6.217140e-49
## radius_worst 1.135630e-78

Wilcoxon test results: The p-values are < 0.01. Hence, we reject the null hypothesis. There are significant differences for all radius variables (mean, se, worst) between the groups.

The malignant breast cancer group has the feature radius values (mean of distances from center to points on the perimeter) higher than the benign group.

cor.test(df$radius_mean, df$radius_worst)
## 
##  Pearson's product-moment correlation
## 
## data:  df$radius_mean and df$radius_worst
## t = 94.255, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9641806 0.9741064
## sample estimates:
##      cor 
## 0.969539
ggplot(df, aes(radius_mean, radius_worst)) +
  geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
  scale_color_manual(values = c("#f69400", "#838383")) +
  scale_fill_manual(values = c("#f69400", "#838383")) +
  facet_wrap(~diagnosis) +
  stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
  stat_cor(aes(color = diagnosis), label.y = 4.4) +
  stat_poly_eq(
    aes(color = diagnosis, label = ..eq.label..),
    formula = y ~ x, label.y = 4.2, parse = TRUE) +
  theme_bw() +
  ggtitle("Correlation of radius variables") +
  theme(plot.title = element_text(hjust = 0.5))

Correlation analysis: The analysis showed a positive, very strong (0.969539) and statistical significance (p-value < 2.2e-16) correlation between radius_mean and radius_worst variables.

A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the radius feature and the diagnosis (benign or malignant).

b1 <- biserial.cor(df$radius_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "strong")
## Correlation value (r):  0.7300285 strong
b2 <- biserial.cor(df$radius_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "moderate")
## Correlation value (r):  0.5671338 moderate
b3 <- biserial.cor(df$radius_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "strong")
## Correlation value (r):  0.7764538 strong

Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.

out_1 <- which(df$radius_mean %in% boxplot(df$radius_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 14
df[as.numeric(out_1),c("id", "diagnosis", "radius_mean")]
##            id diagnosis radius_mean
## 83    8611555         M       25.22
## 109     86355         M       22.27
## 123    865423         M       24.25
## 165   8712289         M       23.27
## 181    873592         M       27.22
## 203    878796         M       23.29
## 213   8810703         M       28.11
## 237  88299702         M       23.21
## 340     89812         M       23.51
## 353    899987         M       25.73
## 370   9012000         M       22.01
## 462 911296202         M       27.42
## 504    915143         M       23.09
## 522  91762702         M       24.63
out_2 <- which(df$radius_se %in% boxplot(df$radius_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 38
df[as.numeric(out_2),c("id", "diagnosis", "radius_se")]
##            id diagnosis radius_se
## 1      842302         M    1.0950
## 13     846226         M    0.9555
## 26     852631         M    1.0460
## 28     852781         M    0.8529
## 39     855133         M    1.2140
## 43     855625         M    0.9811
## 78    8610637         M    0.9806
## 79    8610862         M    0.9317
## 83    8611555         M    0.8973
## 109     86355         M    1.2150
## 123    865423         M    1.5090
## 139    868826         M    1.2960
## 162   8711803         M    1.0000
## 169   8712766         M    1.0880
## 211 881046502         M    0.8601
## 213   8810703         M    2.8730
## 219   8811842         M    0.9553
## 237  88299702         M    1.0580
## 251    884948         M    1.0040
## 259    887181         M    1.2920
## 266  88995002         M    1.1720
## 273   8910988         M    1.1670
## 291  89143602         B    0.8811
## 301    892438         M    1.1110
## 303  89263202         M    1.0720
## 340     89812         M    1.0090
## 353    899987         M    0.9948
## 367   9011494         M    0.9761
## 369   9011971         M    1.2070
## 370   9012000         M    1.0080
## 418  90602302         M    1.3700
## 461 911296201         M    0.9291
## 462 911296202         M    2.5470
## 469   9113538         M    0.9289
## 504    915143         M    1.2910
## 522  91762702         M    0.9915
## 564    926125         M    0.9622
## 565    926424         M    1.1760
out_3 <- which(df$radius_worst %in% boxplot(df$radius_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 17
df[as.numeric(out_3),c("id", "diagnosis", "radius_worst")]
##            id diagnosis radius_worst
## 24     851509         M        29.17
## 83    8611555         M        30.00
## 109     86355         M        28.40
## 165   8712289         M        28.01
## 181    873592         M        33.12
## 213   8810703         M        28.11
## 220  88119002         M        27.90
## 237  88299702         M        31.01
## 266  88995002         M        32.49
## 273   8910988         M        28.19
## 340     89812         M        30.67
## 353    899987         M        33.13
## 369   9011971         M        30.75
## 370   9012000         M        27.66
## 462 911296202         M        36.04
## 504    915143         M        30.79
## 522  91762702         M        29.92

2. texture

texture <- df %>%
 dplyr::select(c(diagnosis, texture_mean, texture_se, texture_worst)) %>%
 group_by(diagnosis) %>%
 summarise(Mean_texture_mean = mean(texture_mean), Mean_texture_se = mean(texture_se), Mean_texture_worst = mean(texture_worst))

formattable(texture, list(
 diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
 Mean_texture_mean = color_tile("#f7d383", "#fec306"),
 Mean_texture_se = color_tile("#eb724d", "#df5227"),
 Mean_texture_worst = color_tile("#b8ddf2", "#56B4E9")))
diagnosis Mean_texture_mean Mean_texture_se Mean_texture_worst
B 17.91476 1.220380 23.51507
M 21.60491 1.210915 29.31821

The mean of texture variables (mean, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.

test.m <- melt(df,id.vars='diagnosis', measure.vars=c('texture_mean','texture_se','texture_worst'))

ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
  geom_boxplot(alpha = 2/3) +
  labs(x = 'diagnosis') +
  scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  theme_bw() + ggtitle("diagnosis x texture variables") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  geom_jitter(alpha = I(1/4), aes(color = variable)) +
  stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))

The variability/spread for texture variables (mean, se, worst) seems to be similar between the groups.

ggplot(test.m, aes(x=value)) +
  geom_histogram(binwidth=2, aes(y=..density..), position="identity", alpha=0.7, color="black") +
  geom_density(alpha=0.4, color = NA) +
  labs(x = "", y = "Count", title = 'Distribution of the texture variables') + theme_bw() +
  aes(fill = variable) +
  scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  ylim(0, 0.4)

shapiro.tests <- t(as.data.frame(lapply(df[,c("texture_mean", "texture_se", "texture_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
##                    p-value
## texture_mean  7.283581e-08
## texture_se    3.560601e-19
## texture_worst 2.564467e-06

Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the texture variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).

wilcox.tests <- t(as.data.frame(lapply(df[,c("texture_mean", "texture_se", "texture_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
##                    p-value
## texture_mean  3.428627e-28
## texture_se    6.436927e-01
## texture_worst 6.517718e-30

Wilcoxon test results: The p-values are < 0.01 in 2 of 3 texture variables. Hence, we reject the null hypothesis. There are significant differences for texture variables (mean, worst) between the groups.

The malignant breast cancer group has the feature texture values (standard deviation of gray-scale values) higher than the benign group.

cor.test(df$texture_mean, df$texture_worst)
## 
##  Pearson's product-moment correlation
## 
## data:  df$texture_mean and df$texture_worst
## t = 52.957, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8971007 0.9249041
## sample estimates:
##       cor 
## 0.9120446
ggplot(df, aes(texture_mean, texture_worst)) +
  geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
  scale_color_manual(values = c("#f69400", "#838383")) +
  scale_fill_manual(values = c("#f69400", "#838383")) +
  facet_wrap(~diagnosis) +
  stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
  stat_cor(aes(color = diagnosis), label.y = 4.4) +
  stat_poly_eq(
    aes(color = diagnosis, label = ..eq.label..),
    formula = y ~ x, label.y = 4.2, parse = TRUE) +
  theme_bw() +
  ggtitle("Correlation of texture variables") +
  theme(plot.title = element_text(hjust = 0.5))

Correlation analysis: The analysis showed a positive, very strong (0.9120446) and statistical significance (p-value < 2.2e-16) correlation between texture_mean and texture_worst variables.

A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the texture feature and the diagnosis (benign or malignant).

b1 <- biserial.cor(df$texture_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "moderate")
## Correlation value (r):  0.4151853 moderate
b2 <- biserial.cor(df$texture_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "very weak")
## Correlation value (r):  -0.008303333 very weak
b3 <- biserial.cor(df$texture_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "moderate")
## Correlation value (r):  0.4569028 moderate

Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.

out_1 <- which(df$texture_mean %in% boxplot(df$texture_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 7
df[as.numeric(out_1),c("id", "diagnosis", "texture_mean")]
##           id diagnosis texture_mean
## 220 88119002         M        32.47
## 233 88203002         B        33.81
## 240 88330202         M        39.28
## 260 88725602         M        33.56
## 266 88995002         M        31.12
## 456  9112085         B        30.72
## 563   925622         M        30.62
out_2 <- which(df$texture_se %in% boxplot(df$texture_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 20
df[as.numeric(out_2),c("id", "diagnosis", "texture_se")]
##          id diagnosis texture_se
## 13   846226         M      3.568
## 84  8611792         M      2.910
## 123  865423         M      3.120
## 137  868223         B      2.508
## 153 8710441         B      2.664
## 193  875099         B      4.885
## 246  884437         B      2.612
## 259  887181         M      2.454
## 315  894047         B      2.777
## 346  898677         B      2.509
## 390   90312         M      2.836
## 417  905978         B      2.878
## 444  909777         B      2.542
## 472 9113816         B      2.643
## 474 9113846         B      3.647
## 529  918192         B      2.635
## 558  925236         B      2.927
## 560  925291         B      2.904
## 562  925311         B      3.896
## 566  926682         M      2.463
out_3 <- which(df$texture_worst %in% boxplot(df$texture_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 5
df[as.numeric(out_3),c("id", "diagnosis", "texture_worst")]
##           id diagnosis texture_worst
## 220 88119002         M         45.41
## 240 88330202         M         44.87
## 260 88725602         M         49.54
## 266 88995002         M         47.16
## 563   925622         M         42.79

3. perimeter

perimeter <- df %>%
 dplyr::select(c(diagnosis, perimeter_mean, perimeter_se, perimeter_worst)) %>%
 group_by(diagnosis) %>%
 summarise(Mean_perimeter_mean = mean(perimeter_mean), Mean_perimeter_se = mean(perimeter_se), Mean_perimeter_worst = mean(perimeter_worst))

formattable(perimeter, list(
 diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
 Mean_perimeter_mean = color_tile("#f7d383", "#fec306"),
 Mean_perimeter_se = color_tile("#eb724d", "#df5227"),
 Mean_perimeter_worst = color_tile("#b8ddf2", "#56B4E9")))
diagnosis Mean_perimeter_mean Mean_perimeter_se Mean_perimeter_worst
B 78.07541 2.000321 87.00594
M 115.36538 4.323929 141.37033

The mean of perimeter variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.

test.m <- melt(df,id.vars='diagnosis', measure.vars=c('perimeter_mean','perimeter_se','perimeter_worst'))

ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
  geom_boxplot(alpha = 2/3) +
  labs(x = 'diagnosis') +
  scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  theme_bw() + ggtitle("diagnosis x perimeter variables") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  geom_jitter(alpha = I(1/4), aes(color = variable)) +
  stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))

Higher variability/spread for perimeter variables (mean, se, worst) was observed in the malignant breast cancer group.

ggplot(test.m, aes(x=value)) +
  geom_histogram(binwidth=10, aes(y=..density..), position="identity", alpha=0.7, color="black") +
  geom_density(alpha=0.4, color = NA) +
  labs(x = "", y = "Count", title = 'Distribution of the perimeter variables') + theme_bw() +
  aes(fill = variable) +
  scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  ylim(0, 0.15)

shapiro.tests <- t(as.data.frame(lapply(df[,c("perimeter_mean", "perimeter_se", "perimeter_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
##                      p-value
## perimeter_mean  7.011402e-15
## perimeter_se    7.587488e-30
## perimeter_worst 1.373336e-17

Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the perimeter variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).

wilcox.tests <- t(as.data.frame(lapply(df[,c("perimeter_mean", "perimeter_se", "perimeter_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
##                      p-value
## perimeter_mean  3.553870e-71
## perimeter_se    5.099437e-51
## perimeter_worst 2.583004e-80

Wilcoxon test results: The p-values are < 0.01. Hence, we reject the null hypothesis. There are significant differences for all perimeter variables (mean, se, worst) between the groups.

The malignant breast cancer group has the feature perimeter values higher than the benign group.

cor.test(df$perimeter_mean, df$perimeter_worst)
## 
##  Pearson's product-moment correlation
## 
## data:  df$perimeter_mean and df$perimeter_worst
## t = 95.657, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9651750 0.9748288
## sample estimates:
##       cor 
## 0.9703869
ggplot(df, aes(perimeter_mean, perimeter_worst)) +
  geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
  scale_color_manual(values = c("#f69400", "#838383")) +
  scale_fill_manual(values = c("#f69400", "#838383")) +
  facet_wrap(~diagnosis) +
  stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
  stat_cor(aes(color = diagnosis), label.y = 4.4) +
  stat_poly_eq(
    aes(color = diagnosis, label = ..eq.label..),
    formula = y ~ x, label.y = 4.2, parse = TRUE) +
  theme_bw() +
  ggtitle("Correlation of perimeter variables") +
  theme(plot.title = element_text(hjust = 0.5))

Correlation analysis: The analysis showed a positive, very strong (0.9703869) and statistical significance (p-value < 2.2e-16) correlation between perimeter_mean and perimeter_worst variables.

A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the perimeter feature and the diagnosis (benign or malignant).

b1 <- biserial.cor(df$perimeter_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "strong")
## Correlation value (r):  0.7426355 strong
b2 <- biserial.cor(df$perimeter_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "moderate")
## Correlation value (r):  0.5561407 moderate
b3 <- biserial.cor(df$perimeter_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "strong")
## Correlation value (r):  0.7829141 strong

Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.

out_1 <- which(df$perimeter_mean %in% boxplot(df$perimeter_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 13
df[as.numeric(out_1),c("id", "diagnosis", "perimeter_mean")]
##            id diagnosis perimeter_mean
## 83    8611555         M          171.5
## 109     86355         M          152.8
## 123    865423         M          166.2
## 165   8712289         M          152.1
## 181    873592         M          182.1
## 203    878796         M          158.9
## 213   8810703         M          188.5
## 237  88299702         M          153.5
## 340     89812         M          155.1
## 353    899987         M          174.2
## 462 911296202         M          186.9
## 504    915143         M          152.1
## 522  91762702         M          165.5
out_2 <- which(df$perimeter_se %in% boxplot(df$perimeter_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 38
df[as.numeric(out_2),c("id", "diagnosis", "perimeter_se")]
##            id diagnosis perimeter_se
## 1      842302         M        8.589
## 13     846226         M       11.070
## 26     852631         M        7.276
## 39     855133         M        8.077
## 43     855625         M        8.830
## 78    8610637         M        6.311
## 79    8610862         M        8.649
## 83    8611555         M        7.382
## 109     86355         M       10.050
## 123    865423         M        9.807
## 139    868826         M        8.419
## 162   8711803         M        6.971
## 169   8712766         M        7.337
## 211 881046502         M        7.029
## 213   8810703         M       21.980
## 219   8811842         M        6.487
## 237  88299702         M        7.247
## 251    884948         M        6.372
## 257  88649001         M        7.158
## 259    887181         M       10.120
## 263    888570         M        6.146
## 266  88995002         M        7.749
## 273   8910988         M        8.867
## 301    892438         M        7.237
## 303  89263202         M        7.804
## 336  89742801         M        6.076
## 340     89812         M        6.462
## 353    899987         M        7.222
## 367   9011494         M        7.128
## 369   9011971         M        7.733
## 370   9012000         M        7.561
## 418  90602302         M        9.424
## 461 911296201         M        6.051
## 462 911296202         M       18.650
## 504    915143         M        9.635
## 522  91762702         M        7.050
## 564    926125         M        8.758
## 565    926424         M        7.673
out_3 <- which(df$perimeter_worst %in% boxplot(df$perimeter_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 15
df[as.numeric(out_3),c("id", "diagnosis", "perimeter_worst")]
##            id diagnosis perimeter_worst
## 24     851509         M           188.0
## 83    8611555         M           211.7
## 109     86355         M           206.8
## 181    873592         M           220.8
## 213   8810703         M           188.5
## 237  88299702         M           206.0
## 266  88995002         M           214.0
## 273   8910988         M           195.9
## 340     89812         M           202.4
## 353    899987         M           229.3
## 369   9011971         M           199.5
## 370   9012000         M           195.0
## 462 911296202         M           251.2
## 504    915143         M           211.5
## 522  91762702         M           205.7

4. area

area <- df %>%
 dplyr::select(c(diagnosis, area_mean, area_se, area_worst)) %>%
 group_by(diagnosis) %>%
 summarise(Mean_area_mean = mean(area_mean), Mean_area_se = mean(area_se), Mean_area_worst = mean(area_worst))

formattable(area, list(
 diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
 Mean_area_mean = color_tile("#f7d383", "#fec306"),
 Mean_area_se = color_tile("#eb724d", "#df5227"),
 Mean_area_worst = color_tile("#b8ddf2", "#56B4E9")))
diagnosis Mean_area_mean Mean_area_se Mean_area_worst
B 462.7902 21.13515 558.8994
M 978.3764 72.67241 1422.2863

The mean of area variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.

test.m <- melt(df,id.vars='diagnosis', measure.vars=c('area_mean','area_se','area_worst'))

ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
  geom_boxplot(alpha = 2/3) +
  labs(x = 'diagnosis') +
  scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  theme_bw() + ggtitle("diagnosis x area variables") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  geom_jitter(alpha = I(1/4), aes(color = variable)) +
  stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))

Higher variability/spread for area variables (mean, se, worst) was observed in the malignant breast cancer group.

ggplot(test.m, aes(x=value)) +
  geom_histogram(binwidth=170, aes(y=..density..), position="identity", alpha=0.7, color="black") +
  geom_density(alpha=0.4, color = NA) +
  labs(x = "", y = "Count", title = 'Distribution of the area variables') + theme_bw() +
  aes(fill = variable) +
  scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  ylim(0, 0.015)

shapiro.tests <- t(as.data.frame(lapply(df[,c("area_mean", "area_se", "area_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
##                 p-value
## area_mean  3.196264e-22
## area_se    2.652703e-35
## area_worst 5.595364e-25

Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the area variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).

wilcox.tests <- t(as.data.frame(lapply(df[,c("area_mean", "area_se", "area_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
##                 p-value
## area_mean  1.539780e-68
## area_se    5.767823e-65
## area_worst 1.803309e-78

Wilcoxon test results: The p-values are < 0.01. Hence, we reject the null hypothesis. There are significant differences for all area variables (mean, se, worst) between the groups.

The malignant breast cancer group has the feature area values higher than the benign group.

cor.test(df$area_mean, df$area_worst)
## 
##  Pearson's product-moment correlation
## 
## data:  df$area_mean and df$area_worst
## t = 80.799, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9520827 0.9653017
## sample estimates:
##       cor 
## 0.9592133
ggplot(df, aes(area_mean, area_worst)) +
  geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
  scale_color_manual(values = c("#f69400", "#838383")) +
  scale_fill_manual(values = c("#f69400", "#838383")) +
  facet_wrap(~diagnosis) +
  stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
  stat_cor(aes(color = diagnosis), label.y = 4.4) +
  stat_poly_eq(
    aes(color = diagnosis, label = ..eq.label..),
    formula = y ~ x, label.y = 4.2, parse = TRUE) +
  theme_bw() +
  ggtitle("Correlation of area variables") +
  theme(plot.title = element_text(hjust = 0.5))

Correlation analysis: The analysis showed a positive, very strong (0.9592133) and statistical significance (p-value < 2.2e-16) correlation between area_mean and area_worst variables.

A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the area feature and the diagnosis (benign or malignant).

b1 <- biserial.cor(df$area_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "strong")
## Correlation value (r):  0.7089838 strong
b2 <- biserial.cor(df$area_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "moderate")
## Correlation value (r):  0.5482359 moderate
b3 <- biserial.cor(df$area_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "strong")
## Correlation value (r):  0.733825 strong

Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.

out_1 <- which(df$area_mean %in% boxplot(df$area_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 25
df[as.numeric(out_1),c("id", "diagnosis", "area_mean")]
##            id diagnosis area_mean
## 24     851509         M      1404
## 83    8611555         M      1878
## 109     86355         M      1509
## 123    865423         M      1761
## 165   8712289         M      1686
## 181    873592         M      2250
## 203    878796         M      1685
## 213   8810703         M      2499
## 237  88299702         M      1670
## 251    884948         M      1364
## 266  88995002         M      1419
## 273   8910988         M      1491
## 340     89812         M      1747
## 353    899987         M      2010
## 369   9011971         M      1546
## 370   9012000         M      1482
## 373   9012795         M      1386
## 374    901288         M      1335
## 394    903516         M      1407
## 450 911157302         M      1384
## 462 911296202         M      2501
## 504    915143         M      1682
## 522  91762702         M      1841
## 564    926125         M      1347
## 565    926424         M      1479
out_2 <- which(df$area_se %in% boxplot(df$area_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 65
df[as.numeric(out_2),c("id", "diagnosis", "area_se")]
##            id diagnosis area_se
## 1      842302         M  153.40
## 3    84300903         M   94.03
## 5    84358402         M   94.44
## 13     846226         M  116.20
## 19     849014         M  112.40
## 24     851509         M   93.99
## 25     852552         M  102.60
## 26     852631         M  111.40
## 28     852781         M   93.54
## 31     853401         M  105.00
## 39     855133         M  106.00
## 43     855625         M  104.90
## 54     857392         M   98.81
## 57     857637         M  102.50
## 71     859575         M   96.05
## 78    8610637         M  134.80
## 79    8610862         M  116.40
## 83    8611555         M  120.00
## 96      86208         M   87.87
## 109     86355         M  170.00
## 122     86517         M   90.47
## 123    865423         M  233.00
## 139    868826         M  101.90
## 157   8711202         M   93.91
## 162   8711803         M  119.30
## 163    871201         M   97.07
## 165   8712289         M   97.85
## 169   8712766         M  122.30
## 181    873592         M  128.70
## 211 881046502         M  111.70
## 213   8810703         M  525.60
## 219   8811842         M  124.40
## 220  88119002         M  109.90
## 237  88299702         M  155.80
## 251    884948         M  137.90
## 253    885429         M   92.81
## 257  88649001         M  106.40
## 259    887181         M  138.50
## 263    888570         M   90.94
## 266  88995002         M  199.70
## 273   8910988         M  156.80
## 301    892438         M  133.00
## 303  89263202         M  130.80
## 336  89742801         M   87.17
## 338    897630         M   88.25
## 340     89812         M  164.10
## 353    899987         M  153.10
## 367   9011494         M  103.60
## 369   9011971         M  224.10
## 370   9012000         M  130.20
## 418  90602302         M  176.50
## 434    908445         M  103.90
## 461 911296201         M  115.20
## 462 911296202         M  542.20
## 469   9113538         M  104.90
## 493    914062         M   89.74
## 499    914769         M   95.77
## 504    915143         M  180.20
## 522  91762702         M  139.90
## 534  91930402         M  100.40
## 536    919555         M   87.78
## 564    926125         M  118.80
## 565    926424         M  158.70
## 566    926682         M   99.04
## 568    927241         M   86.22
out_3 <- which(df$area_worst %in% boxplot(df$area_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 35
df[as.numeric(out_3),c("id", "diagnosis", "area_worst")]
##            id diagnosis area_worst
## 1      842302         M       2019
## 2      842517         M       1956
## 19     849014         M       2398
## 24     851509         M       2615
## 25     852552         M       2215
## 57     857637         M       2145
## 83    8611555         M       2562
## 109     86355         M       2360
## 123    865423         M       2073
## 163    871201         M       2232
## 165   8712289         M       2403
## 181    873592         M       3216
## 182    873593         M       2089
## 203    878796         M       1986
## 213   8810703         M       2499
## 219   8811842         M       2009
## 220  88119002         M       2477
## 237  88299702         M       2944
## 251    884948         M       2010
## 255    886226         M       1972
## 266  88995002         M       3432
## 273   8910988         M       2384
## 301    892438         M       2053
## 324    895100         M       1938
## 340     89812         M       2906
## 353    899987         M       3234
## 369   9011971         M       3143
## 370   9012000         M       2227
## 374    901288         M       1946
## 394    903516         M       2081
## 450 911157302         M       2022
## 462 911296202         M       4254
## 504    915143         M       2782
## 522  91762702         M       2642
## 565    926424         M       2027

5. smoothness

smoothness <- df %>%
 dplyr::select(c(diagnosis, smoothness_mean, smoothness_se, smoothness_worst)) %>%
 group_by(diagnosis) %>%
 summarise(Mean_smoothness_mean = mean(smoothness_mean), Mean_smoothness_se = mean(smoothness_se), Mean_smoothness_worst = mean(smoothness_worst))

formattable(smoothness, list(
 diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
 Mean_smoothness_mean = color_tile("#f7d383", "#fec306"),
 Mean_smoothness_se = color_tile("#eb724d", "#df5227"),
 Mean_smoothness_worst = color_tile("#b8ddf2", "#56B4E9")))
diagnosis Mean_smoothness_mean Mean_smoothness_se Mean_smoothness_worst
B 0.09247765 0.007195902 0.1249595
M 0.10289849 0.006780094 0.1448452

The mean of smoothness variables (mean, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.

test.m <- melt(df,id.vars='diagnosis', measure.vars=c('smoothness_mean','smoothness_se','smoothness_worst'))

ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
  geom_boxplot(alpha = 2/3) +
  labs(x = 'diagnosis') +
  scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  theme_bw() + ggtitle("diagnosis x smoothness variables") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  geom_jitter(alpha = I(1/4), aes(color = variable)) +
  stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))

The variability/spread for smoothness variables (mean, se, worst) seems to be similar between the groups.

ggplot(test.m, aes(x=value)) +
  geom_histogram(binwidth=0.001, aes(y=..density..), position="identity", alpha=0.7, color="black") +
  geom_density(alpha=0.4, color = NA) +
  labs(x = "", y = "Count", title = 'Distribution of the smoothness variables') + theme_bw() +
  aes(fill = variable) +
  scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  ylim(0, 0.6)

shapiro.tests <- t(as.data.frame(lapply(df[,c("smoothness_mean", "smoothness_se", "smoothness_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
##                       p-value
## smoothness_mean  8.600833e-05
## smoothness_se    1.361967e-23
## smoothness_worst 2.096993e-04

Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the smoothness variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).

wilcox.tests <- t(as.data.frame(lapply(df[,c("smoothness_mean", "smoothness_se", "smoothness_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
##                       p-value
## smoothness_mean  7.793007e-19
## smoothness_se    2.136316e-01
## smoothness_worst 3.637942e-24

Wilcoxon test results: The p-values are < 0.01 in 2 of 3 texture variables. Hence, we reject the null hypothesis. There are significant differences for smoothness variables (mean, worst) between the groups.

The malignant breast cancer group has the feature smoothness values (local variation in radius lengths) higher than the benign group.

cor.test(df$smoothness_mean, df$smoothness_worst)
## 
##  Pearson's product-moment correlation
## 
## data:  df$smoothness_mean and df$smoothness_worst
## t = 32.347, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7743878 0.8324192
## sample estimates:
##       cor 
## 0.8053242
ggplot(df, aes(smoothness_mean, smoothness_worst)) +
  geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
  scale_color_manual(values = c("#f69400", "#838383")) +
  scale_fill_manual(values = c("#f69400", "#838383")) +
  facet_wrap(~diagnosis) +
  stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
  stat_cor(aes(color = diagnosis), label.y = 4.4) +
  stat_poly_eq(
    aes(color = diagnosis, label = ..eq.label..),
    formula = y ~ x, label.y = 4.2, parse = TRUE) +
  theme_bw() +
  ggtitle("Correlation of smoothness variables") +
  theme(plot.title = element_text(hjust = 0.5))

Correlation analysis: The analysis showed a positive, strong (0.8053242) and statistical significance (p-value < 2.2e-16) correlation between smoothness_mean and smoothness_worst variables.

A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the smoothness feature and the diagnosis (benign or malignant).

b1 <- biserial.cor(df$smoothness_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "weak")
## Correlation value (r):  0.35856 weak
b2 <- biserial.cor(df$smoothness_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "very weak")
## Correlation value (r):  -0.06701601 very weak
b3 <- biserial.cor(df$smoothness_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "moderate")
## Correlation value (r):  0.4214649 moderate

Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.

out_1 <- which(df$smoothness_mean %in% boxplot(df$smoothness_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 6
df[as.numeric(out_1),c("id", "diagnosis", "smoothness_mean")]
##           id diagnosis smoothness_mean
## 4   84348301         M         0.14250
## 106   863030         M         0.13980
## 123   865423         M         0.14470
## 505   915186         B         0.16340
## 521   917092         B         0.13710
## 569    92751         B         0.05263
out_2 <- which(df$smoothness_se %in% boxplot(df$smoothness_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 30
df[as.numeric(out_2),c("id", "diagnosis", "smoothness_se")]
##            id diagnosis smoothness_se
## 72     859711         B       0.01721
## 77    8610629         B       0.01340
## 111    864033         B       0.01385
## 112     86408         B       0.01291
## 117    864726         B       0.01835
## 123    865423         M       0.02333
## 174    871641         B       0.01496
## 177    872608         B       0.01286
## 186    874158         B       0.01439
## 197    875938         M       0.01380
## 213   8810703         M       0.01345
## 214 881094802         M       0.03113
## 246    884437         B       0.01604
## 274   8910996         B       0.01380
## 276   8911164         B       0.01418
## 289   8913049         B       0.01574
## 315    894047         B       0.02075
## 333    897132         B       0.01289
## 346    898677         B       0.01736
## 392    903483         B       0.01582
## 417    905978         B       0.01474
## 425    907145         B       0.01307
## 470    911366         B       0.01459
## 506    915276         B       0.02177
## 508  91544002         B       0.01262
## 521    917092         B       0.01546
## 538    919812         B       0.01288
## 539    921092         B       0.01266
## 540    921362         B       0.01547
## 557    924964         B       0.01291
out_3 <- which(df$smoothness_worst %in% boxplot(df$smoothness_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 7
df[as.numeric(out_3),c("id", "diagnosis", "smoothness_worst")]
##           id diagnosis smoothness_worst
## 4   84348301         M          0.20980
## 42    855563         M          0.19090
## 193   875099         B          0.07117
## 204    87880         M          0.22260
## 380  9013838         M          0.21840
## 505   915186         B          0.19020
## 506   915276         B          0.20060

6. compactness

compactness <- df %>%
 dplyr::select(c(diagnosis, compactness_mean, compactness_se, compactness_worst)) %>%
 group_by(diagnosis) %>%
 summarise(Mean_compactness_mean = mean(compactness_mean), Mean_compactness_se = mean(compactness_se), Mean_compactness_worst = mean(compactness_worst))

formattable(compactness, list(
 diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
 Mean_compactness_mean = color_tile("#f7d383", "#fec306"),
 Mean_compactness_se = color_tile("#eb724d", "#df5227"),
 Mean_compactness_worst = color_tile("#b8ddf2", "#56B4E9")))
diagnosis Mean_compactness_mean Mean_compactness_se Mean_compactness_worst
B 0.08008462 0.02143825 0.1826725
M 0.14518778 0.03228117 0.3748241

The mean of compactness variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.

test.m <- melt(df,id.vars='diagnosis', measure.vars=c('compactness_mean','compactness_se','compactness_worst'))

ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
  geom_boxplot(alpha = 2/3) +
  labs(x = 'diagnosis') +
  scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  theme_bw() + ggtitle("diagnosis x compactness variables") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  geom_jitter(alpha = I(1/4), aes(color = variable)) +
  stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))

Higher variability/spread for compactness variables (mean, se, worst) was observed in the malignant breast cancer group.

ggplot(test.m, aes(x=value)) +
  geom_histogram(binwidth=0.05, aes(y=..density..), position="identity", alpha=0.7, color="black") +
  geom_density(alpha=0.4, color = NA) +
  labs(x = "", y = "Count", title = 'Distribution of the compactness variables') + theme_bw() +
  aes(fill = variable) +
  scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  ylim(0, 0.5)

shapiro.tests <- t(as.data.frame(lapply(df[,c("compactness_mean", "compactness_se", "compactness_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
##                        p-value
## compactness_mean  3.967204e-17
## compactness_se    1.082957e-23
## compactness_worst 1.247461e-19

Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the compactness variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).

wilcox.tests <- t(as.data.frame(lapply(df[,c("compactness_mean", "compactness_se", "compactness_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
##                        p-value
## compactness_mean  8.951992e-48
## compactness_se    1.168061e-19
## compactness_worst 2.115525e-47

Wilcoxon test results: The p-values are < 0.01. Hence, we reject the null hypothesis. There are significant differences for all compactness variables (mean, se, worst) between the groups.

The malignant breast cancer group has the feature compactness values (perimeter^2 / area - 1.0) higher than the benign group.

cor.test(df$compactness_mean, df$compactness_worst)
## 
##  Pearson's product-moment correlation
## 
## data:  df$compactness_mean and df$compactness_worst
## t = 41.202, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8436520 0.8850219
## sample estimates:
##      cor 
## 0.865809
ggplot(df, aes(compactness_mean, compactness_worst)) +
  geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
  scale_color_manual(values = c("#f69400", "#838383")) +
  scale_fill_manual(values = c("#f69400", "#838383")) +
  facet_wrap(~diagnosis) +
  stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
  stat_cor(aes(color = diagnosis), label.y = 4.4) +
  stat_poly_eq(
    aes(color = diagnosis, label = ..eq.label..),
    formula = y ~ x, label.y = 4.2, parse = TRUE) +
  theme_bw() +
  ggtitle("Correlation of compactness variables") +
  theme(plot.title = element_text(hjust = 0.5))

Correlation analysis: The analysis showed a positive, very strong (0.865809) and statistical significance (p-value < 2.2e-16) correlation between compactness_mean and compactness_worst variables.

A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the compactness feature and the diagnosis (benign or malignant).

b1 <- biserial.cor(df$compactness_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "moderate")
## Correlation value (r):  0.5965337 moderate
b2 <- biserial.cor(df$compactness_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "weak")
## Correlation value (r):  0.2929992 weak
b3 <- biserial.cor(df$compactness_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "moderate")
## Correlation value (r):  0.5909982 moderate

Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.

out_1 <- which(df$compactness_mean %in% boxplot(df$compactness_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 16
df[as.numeric(out_1),c("id", "diagnosis", "compactness_mean")]
##           id diagnosis compactness_mean
## 1     842302         M           0.2776
## 4   84348301         M           0.2839
## 10  84501001         M           0.2396
## 13    846226         M           0.2458
## 15  84667401         M           0.2293
## 79   8610862         M           0.3454
## 83   8611555         M           0.2665
## 109    86355         M           0.2768
## 123   865423         M           0.2867
## 182   873593         M           0.2832
## 191   874858         M           0.2413
## 259   887181         M           0.3114
## 352   899667         M           0.2364
## 353   899987         M           0.2363
## 401 90439701         M           0.2576
## 568   927241         M           0.2770
out_2 <- which(df$compactness_se %in% boxplot(df$compactness_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 28
df[as.numeric(out_2),c("id", "diagnosis", "compactness_se")]
##            id diagnosis compactness_se
## 4    84348301         M        0.07458
## 10   84501001         M        0.07217
## 13     846226         M        0.08297
## 43     855625         M        0.10060
## 63     858986         M        0.07056
## 69     859471         B        0.08606
## 72     859711         B        0.09368
## 79    8610862         M        0.06835
## 109     86355         M        0.08668
## 113     86409         B        0.07446
## 117    864726         B        0.06760
## 123    865423         M        0.09806
## 153   8710441         B        0.09586
## 177    872608         B        0.08808
## 191    874858         M        0.13540
## 214 881094802         M        0.08555
## 289   8913049         B        0.08262
## 291  89143602         B        0.10640
## 319    894329         B        0.06590
## 352    899667         M        0.06559
## 377    901315         B        0.07643
## 389    903011         B        0.06669
## 431    907914         M        0.06213
## 466   9113239         B        0.06657
## 469   9113538         M        0.07025
## 486    913063         B        0.07471
## 540    921362         B        0.06457
## 568    927241         M        0.06158
out_3 <- which(df$compactness_worst %in% boxplot(df$compactness_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 16
df[as.numeric(out_3),c("id", "diagnosis", "compactness_worst")]
##           id diagnosis compactness_worst
## 1     842302         M            0.6656
## 4   84348301         M            0.8663
## 10  84501001         M            1.0580
## 15  84667401         M            0.7725
## 16  84799002         M            0.6577
## 27    852763         M            0.6643
## 34    854002         M            0.6590
## 43    855625         M            0.7444
## 73    859717         M            0.7394
## 109    86355         M            0.6997
## 182   873593         M            0.7584
## 191   874858         M            0.9327
## 380  9013838         M            0.9379
## 431   907914         M            0.7090
## 563   925622         M            0.7917
## 568   927241         M            0.8681

7. concavity

concavity <- df %>%
 dplyr::select(c(diagnosis, concavity_mean, concavity_se, concavity_worst)) %>%
 group_by(diagnosis) %>%
 summarise(Mean_concavity_mean = mean(concavity_mean), Mean_concavity_se = mean(concavity_se), Mean_concavity_worst = mean(concavity_worst))

formattable(concavity, list(
 diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
 Mean_concavity_mean = color_tile("#f7d383", "#fec306"),
 Mean_concavity_se = color_tile("#eb724d", "#df5227"),
 Mean_concavity_worst = color_tile("#b8ddf2", "#56B4E9")))
diagnosis Mean_concavity_mean Mean_concavity_se Mean_concavity_worst
B 0.04605762 0.02599674 0.1662377
M 0.16077472 0.04182401 0.4506056

The mean of concavity variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.

test.m <- melt(df,id.vars='diagnosis', measure.vars=c('concavity_mean','concavity_se','concavity_worst'))

ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
  geom_boxplot(alpha = 2/3) +
  labs(x = 'diagnosis') +
  scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  theme_bw() + ggtitle("diagnosis x concavity variables") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  geom_jitter(alpha = I(1/4), aes(color = variable)) +
  stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))

Higher variability/spread for concavity variables (mean, se, worst) was observed in the malignant breast cancer group.

ggplot(test.m, aes(x=value)) +
  geom_histogram(binwidth=0.05, aes(y=..density..), position="identity", alpha=0.7, color="black") +
  geom_density(alpha=0.4, color = NA) +
  labs(x = "", y = "Count", title = 'Distribution of the concavity variables') + theme_bw() +
  aes(fill = variable) +
  scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  ylim(0, 0.5)

shapiro.tests <- t(as.data.frame(lapply(df[,c("concavity_mean", "concavity_se", "concavity_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
##                      p-value
## concavity_mean  1.338571e-21
## concavity_se    1.101681e-31
## concavity_worst 4.543300e-17

Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the concavity variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).

wilcox.tests <- t(as.data.frame(lapply(df[,c("concavity_mean", "concavity_se", "concavity_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
##                      p-value
## concavity_mean  2.164549e-68
## concavity_se    3.675508e-29
## concavity_worst 1.761723e-63

Wilcoxon test results: The p-values are < 0.01. Hence, we reject the null hypothesis. There are significant differences for all concavity variables (mean, se, worst) between the groups.

The malignant breast cancer group has the feature concavity values (severity of concave portions of the contour) higher than the benign group.

cor.test(df$concavity_mean, df$concavity_worst)
## 
##  Pearson's product-moment correlation
## 
## data:  df$concavity_mean and df$concavity_worst
## t = 45.051, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8647472 0.9008355
## sample estimates:
##       cor 
## 0.8841026
ggplot(df, aes(concavity_mean, concavity_worst)) +
  geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
  scale_color_manual(values = c("#f69400", "#838383")) +
  scale_fill_manual(values = c("#f69400", "#838383")) +
  facet_wrap(~diagnosis) +
  stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
  stat_cor(aes(color = diagnosis), label.y = 4.4) +
  stat_poly_eq(
    aes(color = diagnosis, label = ..eq.label..),
    formula = y ~ x, label.y = 4.2, parse = TRUE) +
  theme_bw() +
  ggtitle("Correlation of concavity variables") +
  theme(plot.title = element_text(hjust = 0.5))

Correlation analysis: The analysis showed a positive, very strong (0.8841026) and statistical significance (p-value < 2.2e-16) correlation between concavity_mean and concavity_worst variables.

A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the concavity feature and the diagnosis (benign or malignant).

b1 <- biserial.cor(df$concavity_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "strong")
## Correlation value (r):  0.6963597 strong
b2 <- biserial.cor(df$concavity_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "weak")
## Correlation value (r):  0.2537298 weak
b3 <- biserial.cor(df$concavity_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "strong")
## Correlation value (r):  0.6596102 strong

Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.

out_1 <- which(df$concavity_mean %in% boxplot(df$concavity_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 18
df[as.numeric(out_1),c("id", "diagnosis", "concavity_mean")]
##            id diagnosis concavity_mean
## 1      842302         M         0.3001
## 69     859471         B         0.3130
## 79    8610862         M         0.3754
## 83    8611555         M         0.3339
## 109     86355         M         0.4264
## 113     86409         B         0.3003
## 123    865423         M         0.4268
## 153   8710441         B         0.4108
## 181    873592         M         0.2871
## 203    878796         M         0.3523
## 213   8810703         M         0.3201
## 259    887181         M         0.3176
## 352    899667         M         0.2914
## 353    899987         M         0.3368
## 401  90439701         M         0.3189
## 462 911296202         M         0.3635
## 564    926125         M         0.3174
## 568    927241         M         0.3514
out_2 <- which(df$concavity_se %in% boxplot(df$concavity_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 22
df[as.numeric(out_2),c("id", "diagnosis", "concavity_se")]
##            id diagnosis concavity_se
## 13     846226         M      0.08890
## 43     855625         M      0.09723
## 69     859471         B      0.30380
## 79    8610862         M      0.10910
## 109     86355         M      0.10400
## 113     86409         B      0.14350
## 117    864726         B      0.09263
## 123    865423         M      0.12780
## 153   8710441         B      0.39600
## 177    872608         B      0.11970
## 191    874858         M      0.11660
## 203    878796         M      0.08958
## 214 881094802         M      0.14380
## 243    883852         B      0.08880
## 251    884948         M      0.09518
## 291  89143602         B      0.09960
## 319    894329         B      0.10270
## 352    899667         M      0.09953
## 377    901315         B      0.15350
## 389    903011         B      0.09472
## 486    913063         B      0.11140
## 540    921362         B      0.09252
out_3 <- which(df$concavity_worst %in% boxplot(df$concavity_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 12
df[as.numeric(out_3),c("id", "diagnosis", "concavity_worst")]
##           id diagnosis concavity_worst
## 10  84501001         M          1.1050
## 69    859471         B          1.2520
## 109    86355         M          0.9608
## 153  8710441         B          0.8216
## 191   874858         M          0.8488
## 203   878796         M          0.7892
## 253   885429         M          0.8489
## 380  9013838         M          0.8402
## 401 90439701         M          0.9034
## 431   907914         M          0.9019
## 563   925622         M          1.1700
## 568   927241         M          0.9387

8. concave_points

concave_points <- df %>%
 dplyr::select(c(diagnosis, concave_points_mean, concave_points_se, concave_points_worst)) %>%
 group_by(diagnosis) %>%
 summarise(Mean_concave_points_mean = mean(concave_points_mean), Mean_concave_points_se = mean(concave_points_se), Mean_concave_points_worst = mean(concave_points_worst))

formattable(concave_points, list(
 diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
 Mean_concave_points_mean = color_tile("#f7d383", "#fec306"),
 Mean_concave_points_se = color_tile("#eb724d", "#df5227"),
 Mean_concave_points_worst = color_tile("#b8ddf2", "#56B4E9")))
diagnosis Mean_concave_points_mean Mean_concave_points_se Mean_concave_points_worst
B 0.02571741 0.009857653 0.07444434
M 0.08799000 0.015060472 0.18223731

The mean of concave_points variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.

test.m <- melt(df,id.vars='diagnosis', measure.vars=c('concave_points_mean','concave_points_se','concave_points_worst'))

ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
  geom_boxplot(alpha = 2/3) +
  labs(x = 'diagnosis') +
  scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  theme_bw() + ggtitle("diagnosis x concave_points variables") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  geom_jitter(alpha = I(1/4), aes(color = variable)) +
  stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))

Higher variability/spread for concave_points variables (mean, se, worst) was observed in the malignant breast cancer group.

ggplot(test.m, aes(x=value)) +
  geom_histogram(binwidth=0.02, aes(y=..density..), position="identity", alpha=0.7, color="black") +
  geom_density(alpha=0.4, color = NA) +
  labs(x = "", y = "Count", title = 'Distribution of the concave_points variables') + theme_bw() +
  aes(fill = variable) +
  scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  ylim(0, 0.5)

shapiro.tests <- t(as.data.frame(lapply(df[,c("concave_points_mean", "concave_points_se", "concave_points_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
##                           p-value
## concave_points_mean  1.404556e-19
## concave_points_se    7.825998e-17
## concave_points_worst 1.984878e-10

Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the concave_points variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).

wilcox.tests <- t(as.data.frame(lapply(df[,c("concave_points_mean", "concave_points_se", "concave_points_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
##                           p-value
## concave_points_mean  1.006324e-76
## concave_points_se    2.370852e-31
## concave_points_worst 1.863997e-77

Wilcoxon test results: The p-values are < 0.01. Hence, we reject the null hypothesis. There are significant differences for all concave_points variables (mean, se, worst) between the groups.

The malignant breast cancer group has the feature concave_points values (number of concave portions of the contour) higher than the benign group.

cor.test(df$concave_points_mean, df$concave_points_worst)
## 
##  Pearson's product-moment correlation
## 
## data:  df$concave_points_mean and df$concave_points_worst
## t = 52.315, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8949081 0.9232799
## sample estimates:
##       cor 
## 0.9101553
ggplot(df, aes(concave_points_mean, concave_points_worst)) +
  geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
  scale_color_manual(values = c("#f69400", "#838383")) +
  scale_fill_manual(values = c("#f69400", "#838383")) +
  facet_wrap(~diagnosis) +
  stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
  stat_cor(aes(color = diagnosis), label.y = 4.4) +
  stat_poly_eq(
    aes(color = diagnosis, label = ..eq.label..),
    formula = y ~ x, label.y = 4.2, parse = TRUE) +
  theme_bw() +
  ggtitle("Correlation of concave_points variables") +
  theme(plot.title = element_text(hjust = 0.5))

Correlation analysis: The analysis showed a positive, very strong (0.9101553) and statistical significance (p-value < 2.2e-16) correlation between concave_points_mean and concave_points_worst variables.

A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the concave_points feature and the diagnosis (benign or malignant).

b1 <- biserial.cor(df$concave_points_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "strong")
## Correlation value (r):  0.7766138 strong
b2 <- biserial.cor(df$concave_points_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "moderate")
## Correlation value (r):  0.4080423 moderate
b3 <- biserial.cor(df$concave_points_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "strong")
## Correlation value (r):  0.793566 strong

Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.

out_1 <- which(df$concave_points_mean %in% boxplot(df$concave_points_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 10
df[as.numeric(out_1),c("id", "diagnosis", "concave_points_mean")]
##            id diagnosis concave_points_mean
## 79    8610862         M              0.1604
## 83    8611555         M              0.1845
## 109     86355         M              0.1823
## 123    865423         M              0.2012
## 181    873592         M              0.1878
## 203    878796         M              0.1620
## 213   8810703         M              0.1595
## 353    899987         M              0.1913
## 394    903516         M              0.1562
## 462 911296202         M              0.1689
out_2 <- which(df$concave_points_se %in% boxplot(df$concave_points_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 19
df[as.numeric(out_2),c("id", "diagnosis", "concave_points_se")]
##            id diagnosis concave_points_se
## 13     846226         M           0.04090
## 43     855625         M           0.02638
## 69     859471         B           0.03322
## 79    8610862         M           0.02593
## 139    868826         M           0.02801
## 153   8710441         B           0.05279
## 162   8711803         M           0.02794
## 211 881046502         M           0.02765
## 214 881094802         M           0.03927
## 259    887181         M           0.03024
## 289   8913049         B           0.03487
## 291  89143602         B           0.02771
## 367   9011494         M           0.02536
## 377    901315         B           0.02919
## 390     90312         M           0.03441
## 462 911296202         M           0.02598
## 486    913063         B           0.02721
## 529    918192         B           0.02853
## 564    926125         M           0.02624
out_3 <- which(df$concave_points_worst %in% boxplot(df$concave_points_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 0
df[as.numeric(out_3),c("id", "diagnosis", "concave_points_worst")]
## [1] id                   diagnosis            concave_points_worst
## <0 rows> (or 0-length row.names)

9. symmetry

symmetry <- df %>%
 dplyr::select(c(diagnosis, symmetry_mean, symmetry_se, symmetry_worst)) %>%
 group_by(diagnosis) %>%
 summarise(Mean_symmetry_mean = mean(symmetry_mean), Mean_symmetry_se = mean(symmetry_se), Mean_symmetry_worst = mean(symmetry_worst))

formattable(symmetry, list(
 diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
 Mean_symmetry_mean = color_tile("#f7d383", "#fec306"),
 Mean_symmetry_se = color_tile("#eb724d", "#df5227"),
 Mean_symmetry_worst = color_tile("#b8ddf2", "#56B4E9")))
diagnosis Mean_symmetry_mean Mean_symmetry_se Mean_symmetry_worst
B 0.174186 0.02058381 0.2702459
M 0.192909 0.02047240 0.3234679

The mean of symmetry variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.

test.m <- melt(df,id.vars='diagnosis', measure.vars=c('symmetry_mean','symmetry_se','symmetry_worst'))

ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
  geom_boxplot(alpha = 2/3) +
  labs(x = 'diagnosis') +
  scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  theme_bw() + ggtitle("diagnosis x symmetry variables") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  geom_jitter(alpha = I(1/4), aes(color = variable)) +
  stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))

Higher variability/spread for symmetry variables (mean, se, worst) was observed in the malignant breast cancer group.

ggplot(test.m, aes(x=value)) +
  geom_histogram(binwidth=0.04, aes(y=..density..), position="identity", alpha=0.7, color="black") +
  geom_density(alpha=0.4, color = NA) +
  labs(x = "", y = "Count", title = 'Distribution of the symmetry variables') + theme_bw() +
  aes(fill = variable) +
  scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  ylim(0, 0.5)

shapiro.tests <- t(as.data.frame(lapply(df[,c("symmetry_mean", "symmetry_se", "symmetry_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
##                     p-value
## symmetry_mean  7.884773e-09
## symmetry_se    3.126807e-24
## symmetry_worst 3.233785e-17

Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the symmetry variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).

wilcox.tests <- t(as.data.frame(lapply(df[,c("symmetry_mean", "symmetry_se", "symmetry_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
##                     p-value
## symmetry_mean  2.268050e-15
## symmetry_se    2.783664e-02
## symmetry_worst 3.151237e-21

Wilcoxon test results: The p-values are < 0.01 in 2 of 3 texture variables. Hence, we reject the null hypothesis. There are significant differences for symmetry variables (mean, worst) between the groups.

The malignant breast cancer group has the feature symmetry values higher than the benign group.

cor.test(df$symmetry_mean, df$symmetry_worst)
## 
##  Pearson's product-moment correlation
## 
## data:  df$symmetry_mean and df$symmetry_worst
## t = 23.329, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6553251 0.7394852
## sample estimates:
##       cor 
## 0.6998258
ggplot(df, aes(symmetry_mean, symmetry_worst)) +
  geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
  scale_color_manual(values = c("#f69400", "#838383")) +
  scale_fill_manual(values = c("#f69400", "#838383")) +
  facet_wrap(~diagnosis) +
  stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
  stat_cor(aes(color = diagnosis), label.y = 4.4) +
  stat_poly_eq(
    aes(color = diagnosis, label = ..eq.label..),
    formula = y ~ x, label.y = 4.2, parse = TRUE) +
  theme_bw() +
  ggtitle("Correlation of symmetry variables") +
  theme(plot.title = element_text(hjust = 0.5))

Correlation analysis: The analysis showed a positive, moderate (0.6998258) and statistical significance (p-value < 2.2e-16) correlation between symmetry_mean and symmetry_worst variables.

A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the symmetry feature and the diagnosis (benign or malignant).

b1 <- biserial.cor(df$symmetry_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "weak")
## Correlation value (r):  0.3304986 weak
b2 <- biserial.cor(df$symmetry_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "very weak")
## Correlation value (r):  -0.006521756 very weak
b3 <- biserial.cor(df$symmetry_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "moderate")
## Correlation value (r):  0.4162943 moderate

Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.

out_1 <- which(df$symmetry_mean %in% boxplot(df$symmetry_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 15
df[as.numeric(out_1),c("id", "diagnosis", "symmetry_mean")]
##            id diagnosis symmetry_mean
## 4    84348301         M        0.2597
## 23    8511133         M        0.2521
## 26     852631         M        0.3040
## 61     858970         B        0.2743
## 79    8610862         M        0.2906
## 109     86355         M        0.2556
## 123    865423         M        0.2655
## 147    869691         M        0.2678
## 151 871001501         B        0.2540
## 153   8710441         B        0.2548
## 259    887181         M        0.2495
## 289   8913049         B        0.2595
## 324    895100         M        0.2569
## 425    907145         B        0.2538
## 562    925311         B        0.1060
out_2 <- which(df$symmetry_se %in% boxplot(df$symmetry_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 27
df[as.numeric(out_2),c("id", "diagnosis", "symmetry_se")]
##           id diagnosis symmetry_se
## 4   84348301         M     0.05963
## 13    846226         M     0.04484
## 23   8511133         M     0.03672
## 43    855625         M     0.05333
## 61    858970         B     0.04183
## 64    859196         B     0.04192
## 69    859471         B     0.04197
## 79   8610862         M     0.07895
## 120   865128         M     0.05014
## 123   865423         M     0.04547
## 139   868826         M     0.05168
## 147   869691         M     0.05628
## 177   872608         B     0.03880
## 191   874858         M     0.05113
## 193   875099         B     0.03799
## 213  8810703         M     0.04783
## 215  8810955         M     0.04499
## 291 89143602         B     0.04077
## 315   894047         B     0.06146
## 330   895633         M     0.04022
## 333   897132         B     0.04243
## 344   898431         M     0.03756
## 346   898677         B     0.03675
## 352   899667         M     0.05543
## 367  9011494         M     0.03710
## 521   917092         B     0.03997
## 554   924342         B     0.03759
out_3 <- which(df$symmetry_worst %in% boxplot(df$symmetry_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 23
df[as.numeric(out_3),c("id", "diagnosis", "symmetry_worst")]
##           id diagnosis symmetry_worst
## 1     842302         M         0.4601
## 4   84348301         M         0.6638
## 9     844981         M         0.4378
## 10  84501001         M         0.4366
## 16  84799002         M         0.4218
## 23   8511133         M         0.4667
## 27    852763         M         0.4264
## 32    853612         M         0.4761
## 35    854039         M         0.4270
## 36    854253         M         0.4863
## 43    855625         M         0.4670
## 69    859471         B         0.4228
## 79   8610862         M         0.5440
## 120   865128         M         0.4882
## 147   869691         M         0.5774
## 191   874858         M         0.5166
## 200   877500         M         0.4753
## 204    87880         M         0.4432
## 215  8810955         M         0.4724
## 324   895100         M         0.5558
## 352   899667         M         0.4245
## 371  9012315         M         0.4824
## 490   913535         M         0.4677

10. fractal_dim

fractal_dimension <- df %>%
 dplyr::select(c(diagnosis, fractal_dimension_mean, fractal_dimension_se, fractal_dimension_worst)) %>%
 group_by(diagnosis) %>%
 summarise(Mean_fractal_dimension_mean = mean(fractal_dimension_mean), Mean_fractal_dimension_se = mean(fractal_dimension_se), Mean_fractal_dimension_worst = mean(fractal_dimension_worst))

formattable(fractal_dimension, list(
 diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
 Mean_fractal_dimension_mean = color_tile("#f7d383", "#fec306"),
 Mean_fractal_dimension_se = color_tile("#eb724d", "#df5227"),
 Mean_fractal_dimension_worst = color_tile("#b8ddf2", "#56B4E9")))
diagnosis Mean_fractal_dimension_mean Mean_fractal_dimension_se Mean_fractal_dimension_worst
B 0.06286739 0.003636051 0.07944207
M 0.06268009 0.004062406 0.09152995

The mean of fractal_dimension variables (se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group. The mean of fractal_dimension_mean is similar in both groups.

test.m <- melt(df,id.vars='diagnosis', measure.vars=c('fractal_dimension_mean','fractal_dimension_se','fractal_dimension_worst'))

ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
  geom_boxplot(alpha = 2/3) +
  labs(x = 'diagnosis') +
  scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
  theme_bw() + ggtitle("diagnosis x fractal_dimension variables") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  geom_jitter(alpha = I(1/4), aes(color = variable)) +
  stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))

Higher variability/spread for fractal_dimension variables (mean, se, worst) was observed in the malignant breast cancer group.

ggplot(test.m, aes(x=value)) +
  geom_histogram(binwidth=0.02, aes(y=..density..), position="identity", alpha=0.7, color="black") +
  geom_density(alpha=0.4, color = NA) +
  labs(x = "", y = "Count", title = 'Distribution of the fractal_dimension variables') + theme_bw() +
  aes(fill = variable) +
  scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_grid(~variable) +
  ylim(0, 0.5)

shapiro.tests <- t(as.data.frame(lapply(df[,c("fractal_dimension_mean", "fractal_dimension_se", "fractal_dimension_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
##                              p-value
## fractal_dimension_mean  1.956575e-16
## fractal_dimension_se    8.551018e-31
## fractal_dimension_worst 9.195146e-20

Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the fractal_dimension variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).

wilcox.tests <- t(as.data.frame(lapply(df[,c("fractal_dimension_mean", "fractal_dimension_se", "fractal_dimension_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
##                              p-value
## fractal_dimension_mean  5.371856e-01
## fractal_dimension_se    1.572165e-06
## fractal_dimension_worst 1.144240e-13

Wilcoxon test results: The p-values are < 0.01 in 2 of 3 texture variables. Hence, we reject the null hypothesis. There are significant differences for fractal_dimension variables (se, worst) between the groups.

The malignant breast cancer group has the feature fractal_dimension values (“coastline approximation” - 1) higher than the benign group only in fractal_dimension_se and fractal_dimension_worst.

cor.test(df$fractal_dimension_mean, df$fractal_dimension_worst)
## 
##  Pearson's product-moment correlation
## 
## data:  df$fractal_dimension_mean and df$fractal_dimension_worst
## t = 28.49, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7312170 0.7990954
## sample estimates:
##       cor 
## 0.7672968
ggplot(df, aes(fractal_dimension_mean, fractal_dimension_worst)) +
  geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
  scale_color_manual(values = c("#f69400", "#838383")) +
  scale_fill_manual(values = c("#f69400", "#838383")) +
  facet_wrap(~diagnosis) +
  stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
  stat_cor(aes(color = diagnosis), label.y = 4.4) +
  stat_poly_eq(
    aes(color = diagnosis, label = ..eq.label..),
    formula = y ~ x, label.y = 4.2, parse = TRUE) +
  theme_bw() +
  ggtitle("Correlation of fractal_dimension variables") +
  theme(plot.title = element_text(hjust = 0.5))

Correlation analysis: The analysis showed a positive, strong (0.7672968) and statistical significance (p-value < 2.2e-16) correlation between fractal_dimension_mean and fractal_dimension_worst variables.

A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the fractal_dimension feature and the diagnosis (benign or malignant).

b1 <- biserial.cor(df$fractal_dimension_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "very weak")
## Correlation value (r):  -0.0128376 very weak
b2 <- biserial.cor(df$fractal_dimension_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "very weak")
## Correlation value (r):  0.07797242 very weak
b3 <- biserial.cor(df$fractal_dimension_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "moderate")
## Correlation value (r):  0.3238722 moderate

Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.

out_1 <- which(df$fractal_dimension_mean %in% boxplot(df$fractal_dimension_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 15
df[as.numeric(out_1),c("id", "diagnosis", "fractal_dimension_mean")]
##            id diagnosis fractal_dimension_mean
## 4    84348301         M                0.09744
## 10   84501001         M                0.08243
## 69     859471         B                0.08046
## 72     859711         B                0.08980
## 79    8610862         M                0.08142
## 152 871001502         B                0.08261
## 153   8710441         B                0.09296
## 177    872608         B                0.08116
## 259    887181         M                0.08104
## 319    894329         B                0.08743
## 377    901315         B                0.08450
## 380   9013838         M                0.07950
## 505    915186         B                0.09502
## 506    915276         B                0.09575
## 508  91544002         B                0.07976
out_2 <- which(df$fractal_dimension_se %in% boxplot(df$fractal_dimension_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 28
df[as.numeric(out_2),c("id", "diagnosis", "fractal_dimension_se")]
##            id diagnosis fractal_dimension_se
## 4    84348301         M             0.009208
## 10   84501001         M             0.010080
## 13     846226         M             0.012840
## 15   84667401         M             0.008093
## 69     859471         B             0.009559
## 72     859711         B             0.021930
## 84    8611792         M             0.010390
## 113     86409         B             0.012980
## 123    865423         M             0.009875
## 146    869476         B             0.009423
## 148  86973701         B             0.009368
## 152 871001502         B             0.011780
## 153   8710441         B             0.029840
## 177    872608         B             0.017920
## 191    874858         M             0.011720
## 214 881094802         M             0.012560
## 243    883852         B             0.008675
## 258    886776         M             0.008660
## 291  89143602         B             0.022860
## 377    901315         B             0.012200
## 389    903011         B             0.012330
## 451   9111596         B             0.008925
## 466   9113239         B             0.008133
## 469   9113538         M             0.011300
## 486    913063         B             0.009627
## 505    915186         B             0.010450
## 506    915276         B             0.011480
## 508  91544002         B             0.008313
out_3 <- which(df$fractal_dimension_worst %in% boxplot(df$fractal_dimension_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 24
df[as.numeric(out_3),c("id", "diagnosis", "fractal_dimension_worst")]
##            id diagnosis fractal_dimension_worst
## 4    84348301         M                  0.1730
## 6      843786         M                  0.1244
## 10   84501001         M                  0.2075
## 15   84667401         M                  0.1431
## 16   84799002         M                  0.1341
## 27     852763         M                  0.1275
## 32     853612         M                  0.1402
## 35     854039         M                  0.1233
## 73     859717         M                  0.1339
## 106    863030         M                  0.1405
## 119    864877         M                  0.1252
## 152 871001502         B                  0.1486
## 153   8710441         B                  0.1259
## 182    873593         M                  0.1284
## 191    874858         M                  0.1446
## 230    881861         M                  0.1243
## 243    883852         B                  0.1297
## 253    885429         M                  0.1297
## 380   9013838         M                  0.1403
## 466   9113239         B                  0.1249
## 505    915186         B                  0.1252
## 506    915276         B                  0.1364
## 563    925622         M                  0.1409
## 568    927241         M                  0.1240


Correlation analysis for all features (30):

df.n <- subset(df, select = -c(id, diagnosis))
corrplot(cor(df.n), type="lower", number.cex = .35, addCoef.col = "black", tl.col = "black", tl.srt = 90, tl.cex = .5, col=brewer.pal(n=8, name="RdBu"), order = "FPC")

The most strong correlations values (0.80 - 0.999) are showed bellow:

cor.sig <- as.data.frame(as.table(cor(df.n)))
cor.sig <- subset(cor.sig, c(abs(Freq) > 0.8 & abs(Freq) != 1))
cor.sig %<>% distinct(Freq, .keep_all = TRUE)
colnames(cor.sig) <- c("Variables_1", "Variables_2", "Correlation Value")
cor.sig[order(-cor.sig$'Correlation Value'),] 
##                Variables_1         Variables_2 Correlation Value
## 1           perimeter_mean         radius_mean         0.9978553
## 37         perimeter_worst        radius_worst         0.9937079
## 2                area_mean         radius_mean         0.9873572
## 8                area_mean      perimeter_mean         0.9865068
## 38              area_worst        radius_worst         0.9840146
## 39              area_worst     perimeter_worst         0.9775781
## 31            perimeter_se           radius_se         0.9727937
## 11         perimeter_worst      perimeter_mean         0.9703869
## 4             radius_worst         radius_mean         0.9695390
## 10            radius_worst      perimeter_mean         0.9694764
## 5          perimeter_worst         radius_mean         0.9651365
## 15            radius_worst           area_mean         0.9627461
## 17              area_worst           area_mean         0.9592133
## 16         perimeter_worst           area_mean         0.9591196
## 32                 area_se           radius_se         0.9518301
## 12              area_worst      perimeter_mean         0.9415498
## 6               area_worst         radius_mean         0.9410825
## 33                 area_se        perimeter_se         0.9376554
## 24     concave_points_mean      concavity_mean         0.9213910
## 7            texture_worst        texture_mean         0.9120446
## 30    concave_points_worst concave_points_mean         0.9101553
## 41         concavity_worst   compactness_worst         0.8922609
## 25         concavity_worst      concavity_mean         0.8841026
## 19          concavity_mean    compactness_mean         0.8831207
## 21       compactness_worst    compactness_mean         0.8658090
## 26    concave_points_worst      concavity_mean         0.8613230
## 28         perimeter_worst concave_points_mean         0.8559231
## 44    concave_points_worst     concavity_worst         0.8554339
## 9      concave_points_mean      perimeter_mean         0.8509770
## 20     concave_points_mean    compactness_mean         0.8311350
## 27            radius_worst concave_points_mean         0.8303176
## 13     concave_points_mean           area_mean         0.8232689
## 3      concave_points_mean         radius_mean         0.8225285
## 40    concave_points_worst     perimeter_worst         0.8163221
## 22         concavity_worst    compactness_mean         0.8162752
## 23    concave_points_worst    compactness_mean         0.8155732
## 34              area_worst             area_se         0.8114080
## 43 fractal_dimension_worst   compactness_worst         0.8104549
## 29              area_worst concave_points_mean         0.8096296
## 18        smoothness_worst     smoothness_mean         0.8053242
## 36    fractal_dimension_se      compactness_se         0.8032688
## 35            concavity_se      compactness_se         0.8012683
## 42    concave_points_worst   compactness_worst         0.8010804
## 14                 area_se           area_mean         0.8000859

EDA Results:

  1. Most features have the means of its variables (mean, se, worst) higher in the malignant breast cancer group as compared to the benign breast cancer group, except:
  • Mean_texture_se (very similar in both groups, p > 0.01)
  • Mean_smoothness_se (very similar in both groups, p > 0.01)
  • Mean_symmetry_se (very similar in both groups, p > 0.01)
  • Mean_fractal_dimension_mean (very similar in both groups, p > 0.01)
  1. All the higher means of the variables (mean, se, worst) in the malignant breast cancer group showed p < 0.01 (statistical significance).

  2. The variables which showed the most statistical significance difference between the malignant breast cancer and the benign breast cancer groups were:

  • perimeter_worst 2.58E-80
  • radius_worst 1.14E-78
  • area_worst 1.80E-78
  • concave_points_worst 1.86E-77
  • concave_points_mean 1.01E-76
  • perimeter_mean 3.55E-71
  • area_mean 1.54E-68
  • concavity_mean 2.16E-68
  • radius_mean 2.69E-68
  • area_se 5.77E-65
  • concavity_worst 1.76E-63
  • perimeter_se 5.10E-51
  • radius_se 6.22E-49
  1. Across 480 correlations between all features (30), 44 (9.08%) showed a very strong correlation value.

  2. Through the point-biserial correlation analysis, the following variables showed a strong correlation value with the diagnosis variable (malignant or benign):

  • radius_mean: 0.73
  • radius_worst: 0.77
  • perimeter_mean: 0.74
  • perimeter_worst: 0.78
  • area_mean: 0.70
  • area_worst: 0.73
  • concavity_mean: 0.69
  • concavity_worst: 0.65
  • concave_points_mean: 0.77
  • concave_points_worst: 0.79

4. Principal Component Analysis (PCA)

Principal components analysis (PCA) is a data-reduction technique that transforms a larger number of correlated variables into a smaller set of uncorrelated variables called principal components (PC) or dimensions. We think, that the PCA method could improve the data analysis of this dataset, which has 30 variables highly correlated.

PCA is a great pre-processing tool for picking out the most relevant linear combination of variables and using them in prediction models.

The only drawback PCA has is that it generates the principal components in a unsupervised manner - without looking at the target vector. Besides, it is generally more difficult to interpret the predictors, since each principal component is a combination of original features.

df <- subset(df, select = -id)
df.v <- subset(df, select = -diagnosis)
df.d <- subset(df, select = diagnosis)

# Apply PCA
df.pca <- PCA(df.v, scale.unit = TRUE, graph = FALSE)
summary(df.pca)
## 
## Call:
## PCA(X = df.v, scale.unit = TRUE, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance              13.282   5.691   2.818   1.981   1.649   1.207
## % of var.             44.272  18.971   9.393   6.602   5.496   4.025
## Cumulative % of var.  44.272  63.243  72.636  79.239  84.734  88.759
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.675   0.477   0.417   0.351   0.294   0.261
## % of var.              2.251   1.589   1.390   1.169   0.980   0.871
## Cumulative % of var.  91.010  92.598  93.988  95.157  96.137  97.007
##                       Dim.13  Dim.14  Dim.15  Dim.16  Dim.17  Dim.18
## Variance               0.241   0.157   0.094   0.080   0.059   0.053
## % of var.              0.805   0.523   0.314   0.266   0.198   0.175
## Cumulative % of var.  97.812  98.335  98.649  98.915  99.113  99.288
##                       Dim.19  Dim.20  Dim.21  Dim.22  Dim.23  Dim.24
## Variance               0.049   0.031   0.030   0.027   0.024   0.018
## % of var.              0.165   0.104   0.100   0.091   0.081   0.060
## Cumulative % of var.  99.453  99.557  99.657  99.749  99.830  99.890
##                       Dim.25  Dim.26  Dim.27  Dim.28  Dim.29  Dim.30
## Variance               0.015   0.008   0.007   0.002   0.001   0.000
## % of var.              0.052   0.027   0.023   0.005   0.002   0.000
## Cumulative % of var.  99.942  99.969  99.992  99.997 100.000 100.000
## 
## Individuals (the 10 first)
##                             Dist    Dim.1    ctr   cos2    Dim.2    ctr
## 1                       | 10.710 |  9.193  1.118  0.737 |  1.949  0.117
## 2                       |  5.132 |  2.388  0.075  0.216 | -3.768  0.438
## 3                       |  6.119 |  5.734  0.435  0.878 | -1.075  0.036
## 4                       | 13.986 |  7.123  0.671  0.259 | 10.276  3.261
## 5                       |  5.868 |  3.935  0.205  0.450 | -1.948  0.117
## 6                       |  5.735 |  2.380  0.075  0.172 |  3.950  0.482
## 7                       |  3.970 |  2.239  0.066  0.318 | -2.690  0.223
## 8                       |  4.195 |  2.143  0.061  0.261 |  2.340  0.169
## 9                       |  6.017 |  3.175  0.133  0.278 |  3.392  0.355
## 10                      | 12.163 |  6.352  0.534  0.273 |  7.727  1.844
##                           cos2    Dim.3    ctr   cos2  
## 1                        0.033 | -1.123  0.079  0.011 |
## 2                        0.539 | -0.529  0.017  0.011 |
## 3                        0.031 | -0.552  0.019  0.008 |
## 4                        0.540 | -3.233  0.652  0.053 |
## 5                        0.110 |  1.390  0.120  0.056 |
## 6                        0.474 | -2.935  0.537  0.262 |
## 7                        0.459 | -1.640  0.168  0.171 |
## 8                        0.311 | -0.872  0.047  0.043 |
## 9                        0.318 | -3.120  0.607  0.269 |
## 10                       0.404 | -4.342  1.176  0.127 |
## 
## Variables (the 10 first)
##                            Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## radius_mean             |  0.798  4.792  0.636 | -0.558  5.469  0.311 |
## texture_mean            |  0.378  1.076  0.143 | -0.142  0.356  0.020 |
## perimeter_mean          |  0.829  5.177  0.688 | -0.513  4.630  0.264 |
## area_mean               |  0.805  4.884  0.649 | -0.551  5.340  0.304 |
## smoothness_mean         |  0.520  2.033  0.270 |  0.444  3.464  0.197 |
## compactness_mean        |  0.872  5.726  0.760 |  0.362  2.307  0.131 |
## concavity_mean          |  0.942  6.677  0.887 |  0.144  0.362  0.021 |
## concave_points_mean     |  0.951  6.804  0.904 | -0.083  0.121  0.007 |
## symmetry_mean           |  0.504  1.909  0.254 |  0.454  3.623  0.206 |
## fractal_dimension_mean  |  0.235  0.414  0.055 |  0.875 13.438  0.765 |
##                          Dim.3    ctr   cos2  
## radius_mean             -0.014  0.007  0.000 |
## texture_mean             0.108  0.417  0.012 |
## perimeter_mean          -0.016  0.009  0.000 |
## area_mean                0.048  0.082  0.002 |
## smoothness_mean         -0.175  1.088  0.031 |
## compactness_mean        -0.124  0.549  0.015 |
## concavity_mean           0.005  0.001  0.000 |
## concave_points_mean     -0.043  0.065  0.002 |
## symmetry_mean           -0.068  0.162  0.005 |
## fractal_dimension_mean  -0.038  0.051  0.001 |
# Extract the eigenvalues of principal components
eig.val <- as.data.frame(get_eigenvalue(df.pca))
subset(eig.val, eigenvalue > 1) # The Kaiser–Harris criterion
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  13.281608        44.272026                    44.27203
## Dim.2   5.691355        18.971182                    63.24321
## Dim.3   2.817949         9.393163                    72.63637
## Dim.4   1.980640         6.602135                    79.23851
## Dim.5   1.648731         5.495768                    84.73427
## Dim.6   1.207357         4.024522                    88.75880

The Kaiser–Harris criterion suggests retaining components with eigenvalues greater than 1 (cutoff point). Thus, the cutoff point has a eigenvalue = 1.207, so we stopped at the sixth principal component.

fviz_eig(df.pca, addlabels=TRUE, hjust = 0, barfill = "#4189b3", ncp=6) + ylim(0, 50)

In our analysis, the first six principal components explain 88.75% of the dataset variance. The first dimension is associated with the largest eigenvalue, the second dimension with the second-largest eigenvalue, and so on.

4.1. Principal Components Exploration:

1. Quality of variables

head(get_pca_var(df.pca)$cos2)
##                      Dim.1      Dim.2        Dim.3       Dim.4
## radius_mean      0.6364318 0.31125539 0.0002050963 0.003396209
## texture_mean     0.1428940 0.02028864 0.0117415199 0.720298141
## perimeter_mean   0.6876316 0.26352690 0.0002444703 0.003491038
## area_mean        0.6486576 0.30389811 0.0023210397 0.005655066
## smoothness_mean  0.2700393 0.19713747 0.0306502709 0.050313944
## compactness_mean 0.7604714 0.13130559 0.0154693024 0.002002220
##                         Dim.5
## radius_mean      0.0023540715
## texture_mean     0.0040347193
## perimeter_mean   0.0023030547
## area_mean        0.0001759769
## smoothness_mean  0.2197586899
## compactness_mean 0.0002258480
par(mfrow=c(1,2))

corrplot(get_pca_var(df.pca)$cos2[1:16,], number.cex = .65, addCoef.col = "black", tl.col = "black", tl.cex = 0.75)

corrplot(get_pca_var(df.pca)$cos2[17:30,], number.cex = .65, addCoef.col = "black", tl.col = "black", tl.cex = 0.75)

For example, in the columns labeled Dim.1 (the first PC), 63.64% of the variance in radius_mean variable is accounted by the Dim.1, while 31% is by the Dim.2 (second PC).

2. Coordinates of variables

head(get_pca_var(df.pca)$coord)
##                      Dim.1      Dim.2       Dim.3       Dim.4       Dim.5
## radius_mean      0.7977668 -0.5579027 -0.01432118 -0.05827700 -0.04851878
## texture_mean     0.3780132 -0.1424382  0.10835829  0.84870380  0.06351944
## perimeter_mean   0.8292355 -0.5133487 -0.01563555 -0.05908501 -0.04799015
## area_mean        0.8053928 -0.5512695  0.04817717 -0.07520017 -0.01326563
## smoothness_mean  0.5196530  0.4440017 -0.17507219 -0.22430770  0.46878427
## compactness_mean 0.8720501  0.3623611 -0.12437565 -0.04474618 -0.01502824

The columns contains the component loadings, which are the correlations of the observed variables with the principal components (PC). The radius_mean is positive strong (0.79) correlated to the first principal component; while is negative moderate (-0.55) correlated to the second component.

fviz_pca_var(df.pca,labelsize = 3, 
             col.var = "coord",
             gradient.cols = c("#56B4E9", "#fec306", "#df5227"),
             repel = TRUE
)

The area_mean, area_worst, radius_mean, radius_worst, perimeter_mean and perimeter_worst are positively correlated, and those 6 metrics contribute the most to the construction of the first principal component (dimension 1).

The fractal_dimension_mean, fractal_dimension_se, fractal_dimension_worst and smoothness_se contribute the most to 2nd component.

Thus, the 1st component mainly relates to geometric quantitative measures (area, radius and perimeter); while the 2nd dimension is mainly relates to appearance/aspect or geometric qualitative measures (fractal_dimensions, smoothness).

3. Contributions of variables

head(get_pca_var(df.pca)$contrib)
##                     Dim.1     Dim.2       Dim.3      Dim.4       Dim.5
## radius_mean      4.791828 5.4689158 0.007278210  0.1714702  0.14278085
## texture_mean     1.075879 0.3564817 0.416669002 36.3669303  0.24471672
## perimeter_mean   5.177322 4.6303018 0.008675469  0.1762581  0.13968654
## area_mean        4.883878 5.3396446 0.082366279  0.2855170  0.01067348
## smoothness_mean  2.033182 3.4638057 1.087680124  2.5402866 13.32896332
## compactness_mean 5.725748 2.3071061 0.548956087  0.1010895  0.01369829
p1 <- fviz_contrib(df.pca, choice = "var", axes = 1, fill="#4189b3", top=15)
p2 <- fviz_contrib(df.pca, choice = "var", axes = 2, fill="#f69400", color="white", top=15)
grid.arrange(p1,p2,ncol=2)

The area_mean contributes 4.88% to the first principal component and 5.33% to the second component. The texture_mean variable contributes 36.36% to the fourth component.

4. Results for individuals

fviz_pca_ind(df.pca,
             geom.ind = "point",
             col.var = "black",
             col.ind = df.d$diagnosis,
             palette = c("#f69400","#4189b3"),
             addEllipses = TRUE,
             legend.title = "Diagnosis",
             mean.point = FALSE, labelsize = 3, pointsize = 3, pointshape = 20)

The 1st principal component (dimension 1) indicates the principal axis of variability between groups (benign and malignant).


5. Modeling and Predictions

# Train and Test (Original Data)
set.seed(1234)
training.samples <- df$diagnosis %>% 
    createDataPartition(p = 0.8, list = FALSE)
df.train <- df[ training.samples,]
df.test  <- df[-training.samples,]

# Train and Test (PCA pre-processing)
df.pca2 <- PCA(df.v, scale.unit = TRUE, graph = FALSE, ncp = 6)
set.seed(1234)
df.pca.final <- cbind(df.d, df.pca2$ind$coord)
training.samples.pca <- df.pca.final$diagnosis %>% 
    createDataPartition(p = 0.8, list = FALSE)
df.train.pca <- df.pca.final[ training.samples.pca,]
df.test.pca  <- df.pca.final[-training.samples.pca,]

1. KNN

# Original Data
set.seed(1234)
model.knn <- train(
  diagnosis ~ ., data = df.train, method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale"),
  tuneLength = 20
  )

plot(model.knn)

# Prediction Original Data
predicted.classes <- model.knn %>% predict(df.test)
matrix.knn <- confusionMatrix(predicted.classes, df.test$diagnosis)
matrix.knn
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71  4
##          M  0 38
##                                           
##                Accuracy : 0.9646          
##                  95% CI : (0.9118, 0.9903)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9227          
##                                           
##  Mcnemar's Test P-Value : 0.1336          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9048          
##          Pos Pred Value : 0.9467          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6283          
##    Detection Prevalence : 0.6637          
##       Balanced Accuracy : 0.9524          
##                                           
##        'Positive' Class : B               
## 
# PCA
set.seed(1234)
model.knn.pca <- train(
  diagnosis ~ ., data = df.train.pca, method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale"),
  tuneLength = 20
  )

plot(model.knn.pca)

# Prediction PCA
predicted.classes.pca <- model.knn.pca %>% predict(df.test.pca)
matrix.knn.pca <- confusionMatrix(predicted.classes.pca, df.test.pca$diagnosis)
matrix.knn.pca
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71 10
##          M  0 32
##                                           
##                Accuracy : 0.9115          
##                  95% CI : (0.8433, 0.9567)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 6.062e-12       
##                                           
##                   Kappa : 0.8008          
##                                           
##  Mcnemar's Test P-Value : 0.004427        
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.7619          
##          Pos Pred Value : 0.8765          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6283          
##    Detection Prevalence : 0.7168          
##       Balanced Accuracy : 0.8810          
##                                           
##        'Positive' Class : B               
## 

2. CART

# CART Model Original Data
set.seed(1234)
model.tree <- rpart(
  diagnosis ~ ., data = df.train, method = "class")

rpart.plot(model.tree, extra=108)
printcp(model.tree)
## 
## Classification tree:
## rpart(formula = diagnosis ~ ., data = df.train, method = "class")
## 
## Variables actually used in tree construction:
## [1] area_se              concave_points_worst perimeter_worst     
## [4] texture_mean        
## 
## Root node error: 170/456 = 0.37281
## 
## n= 456 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.800000      0  1.000000 1.00000 0.060740
## 2 0.076471      1  0.200000 0.29412 0.039248
## 3 0.017647      2  0.123529 0.18235 0.031619
## 4 0.010000      4  0.088235 0.17059 0.030654
rpart.rules(model.tree, extra=108)
##  diagnosis                                                                                                
##       0.90 when perimeter_worst <  115 & concave_points_worst <  0.16 & area_se >= 33 & texture_mean <  21
##       0.98 when perimeter_worst <  115 & concave_points_worst <  0.16 & area_se <  33                     
##       0.75 when perimeter_worst <  115 & concave_points_worst <  0.16 & area_se >= 33 & texture_mean >= 21
##       0.88 when perimeter_worst <  115 & concave_points_worst >= 0.16                                     
##       0.98 when perimeter_worst >= 115
# Prediction
predicted.classes <- model.tree %>% predict(df.test, type = "class")
matrix.tree <- confusionMatrix(predicted.classes, df.test$diagnosis)
matrix.tree
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 68  7
##          M  3 35
##                                           
##                Accuracy : 0.9115          
##                  95% CI : (0.8433, 0.9567)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 6.062e-12       
##                                           
##                   Kappa : 0.8068          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.9577          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.9067          
##          Neg Pred Value : 0.9211          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6018          
##    Detection Prevalence : 0.6637          
##       Balanced Accuracy : 0.8955          
##                                           
##        'Positive' Class : B               
## 
#Pruning
model.tree.p <- prune(model.tree, cp=.011765)
rpart.plot(model.tree.p, extra=108)

rpart.rules(model.tree.p, extra=108)
##  diagnosis                                                                                                
##       0.90 when perimeter_worst <  115 & concave_points_worst <  0.16 & area_se >= 33 & texture_mean <  21
##       0.98 when perimeter_worst <  115 & concave_points_worst <  0.16 & area_se <  33                     
##       0.75 when perimeter_worst <  115 & concave_points_worst <  0.16 & area_se >= 33 & texture_mean >= 21
##       0.88 when perimeter_worst <  115 & concave_points_worst >= 0.16                                     
##       0.98 when perimeter_worst >= 115
# Prediction pos-pruning
predicted.classes <- model.tree.p %>% predict(df.test, type = "class")
matrix.tree.p <- confusionMatrix(predicted.classes, df.test$diagnosis)
matrix.tree.p
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 68  7
##          M  3 35
##                                           
##                Accuracy : 0.9115          
##                  95% CI : (0.8433, 0.9567)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 6.062e-12       
##                                           
##                   Kappa : 0.8068          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.9577          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.9067          
##          Neg Pred Value : 0.9211          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6018          
##    Detection Prevalence : 0.6637          
##       Balanced Accuracy : 0.8955          
##                                           
##        'Positive' Class : B               
## 
# PCA
set.seed(1234)
model.tree.pca <- rpart(
  diagnosis ~ ., data = df.train.pca, method = "class")

rpart.plot(model.tree.pca, extra=108)

printcp(model.tree.pca)
## 
## Classification tree:
## rpart(formula = diagnosis ~ ., data = df.train.pca, method = "class")
## 
## Variables actually used in tree construction:
## [1] Dim.1 Dim.2 Dim.3 Dim.5
## 
## Root node error: 170/456 = 0.37281
## 
## n= 456 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.776471      0  1.000000 1.00000 0.060740
## 2 0.041176      1  0.223529 0.30588 0.039926
## 3 0.035294      3  0.141176 0.26471 0.037462
## 4 0.011765      4  0.105882 0.20588 0.033438
## 5 0.010000      5  0.094118 0.20000 0.032996
rpart.rules(model.tree.pca, extra=108)
##  diagnosis                                                                      
##       0.88 when Dim.1 >=         1.2                               & Dim.5 <  -2
##       0.93 when Dim.1 is -1.0 to 1.2 & Dim.2 >= -1.3 & Dim.3 >= -2              
##       0.99 when Dim.1 <  -1.0                                                   
##       0.62 when Dim.1 is -1.0 to 1.2 & Dim.2 >= -1.3 & Dim.3 <  -2              
##       0.85 when Dim.1 is -1.0 to 1.2 & Dim.2 <  -1.3                            
##       0.98 when Dim.1 >=         1.2                               & Dim.5 >= -2
# Prediction PCA
predicted.classes.pca <- model.tree.pca %>% predict(df.test.pca, type = "class")
matrix.tree.pca <- confusionMatrix(predicted.classes.pca, df.test.pca$diagnosis)
matrix.tree.pca
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 68  8
##          M  3 34
##                                           
##                Accuracy : 0.9027          
##                  95% CI : (0.8325, 0.9504)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 3.429e-11       
##                                           
##                   Kappa : 0.7864          
##                                           
##  Mcnemar's Test P-Value : 0.2278          
##                                           
##             Sensitivity : 0.9577          
##             Specificity : 0.8095          
##          Pos Pred Value : 0.8947          
##          Neg Pred Value : 0.9189          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6018          
##    Detection Prevalence : 0.6726          
##       Balanced Accuracy : 0.8836          
##                                           
##        'Positive' Class : B               
## 

3. Random Forest

# Random Forest Model (All variables) Original Data
set.seed(1234)
model.rf <- train(
  diagnosis ~ ., data = df.train, method = "rf",
  trControl = trainControl("cv", number = 10),
  importance = FALSE
  )
model.rf$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = FALSE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 30
## 
##         OOB estimate of  error rate: 4.39%
## Confusion matrix:
##     B   M class.error
## B 278   8  0.02797203
## M  12 158  0.07058824
# Plot MeanDecreaseGini
varImpPlot(model.rf$finalModel, type = 2)

varImp(model.rf)
## rf variable importance
## 
##   only 20 most important variables shown (out of 30)
## 
##                          Overall
## perimeter_worst         100.0000
## concave_points_worst     86.6696
## area_worst               30.1371
## concave_points_mean      24.6583
## radius_worst             14.9139
## texture_worst             5.9261
## texture_mean              5.4653
## area_se                   4.8550
## concavity_worst           2.8408
## concavity_mean            2.0266
## smoothness_worst          1.5420
## compactness_worst         1.3437
## area_mean                 0.9296
## fractal_dimension_se      0.6642
## symmetry_worst            0.6553
## fractal_dimension_worst   0.6492
## symmetry_mean             0.6414
## radius_se                 0.5587
## concave_points_se         0.4607
## texture_se                0.4460
# Prediction
predicted.classes <- model.rf %>% predict(df.test)
matrix.rf <- confusionMatrix(predicted.classes, df.test$diagnosis)
matrix.rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 69  2
##          M  2 40
##                                           
##                Accuracy : 0.9646          
##                  95% CI : (0.9118, 0.9903)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9242          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9718          
##             Specificity : 0.9524          
##          Pos Pred Value : 0.9718          
##          Neg Pred Value : 0.9524          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6106          
##    Detection Prevalence : 0.6283          
##       Balanced Accuracy : 0.9621          
##                                           
##        'Positive' Class : B               
## 
# Random Forest Model (Top 5 variables most important) Original Data
set.seed(1234)
model.rf2 <- train(
  diagnosis ~ perimeter_worst + radius_worst + concave_points_worst + area_worst + concave_points_mean, data = df.train, method = "rf",
  trControl = trainControl("cv", number = 10),
  importance = FALSE
  )
model.rf2$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = FALSE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 6.58%
## Confusion matrix:
##     B   M class.error
## B 273  13  0.04545455
## M  17 153  0.10000000
# Plot MeanDecreaseGini
varImpPlot(model.rf2$finalModel, type = 2)

varImp(model.rf2)
## rf variable importance
## 
##                      Overall
## perimeter_worst      100.000
## concave_points_worst  57.028
## area_worst            28.657
## radius_worst           8.771
## concave_points_mean    0.000
# Prediction
predicted.classes <- model.rf2 %>% predict(df.test)
matrix.rf2 <- confusionMatrix(predicted.classes, df.test$diagnosis)
matrix.rf2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 68  4
##          M  3 38
##                                           
##                Accuracy : 0.9381          
##                  95% CI : (0.8765, 0.9747)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 1.718e-14       
##                                           
##                   Kappa : 0.8667          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9577          
##             Specificity : 0.9048          
##          Pos Pred Value : 0.9444          
##          Neg Pred Value : 0.9268          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6018          
##    Detection Prevalence : 0.6372          
##       Balanced Accuracy : 0.9313          
##                                           
##        'Positive' Class : B               
## 
# PCA
set.seed(1234)
model.rf.pca <- train(
  diagnosis ~ ., data = df.train.pca, method = "rf",
  trControl = trainControl("cv", number = 10),
  importance = FALSE
  )
model.rf.pca$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = FALSE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.39%
## Confusion matrix:
##     B   M class.error
## B 278   8  0.02797203
## M  12 158  0.07058824
# Plot MeanDecreaseGini
varImpPlot(model.rf.pca$finalModel, type = 2)

varImp(model.rf.pca)
## rf variable importance
## 
##        Overall
## Dim.1 100.0000
## Dim.2  14.4252
## Dim.3  10.1429
## Dim.5   2.3204
## Dim.4   0.3595
## Dim.6   0.0000
# Prediction PCA
predicted.classes.pca <- model.rf.pca %>% predict(df.test.pca)
matrix.rf.pca <- confusionMatrix(predicted.classes.pca, df.test.pca$diagnosis)
matrix.rf.pca
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 69  3
##          M  2 39
##                                           
##                Accuracy : 0.9558          
##                  95% CI : (0.8998, 0.9855)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9048          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9718          
##             Specificity : 0.9286          
##          Pos Pred Value : 0.9583          
##          Neg Pred Value : 0.9512          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6106          
##    Detection Prevalence : 0.6372          
##       Balanced Accuracy : 0.9502          
##                                           
##        'Positive' Class : B               
## 

4. Logistic Regression

model.ml.pca <- train(diagnosis ~., data = df.train.pca, method = "glm")
summary(model.ml.pca)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6870  -0.0488  -0.0042   0.0006   3.5749  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.6840     0.3626  -1.886 0.059244 .  
## Dim.1         2.8572     0.4982   5.735 9.78e-09 ***
## Dim.2        -1.8064     0.3619  -4.991 6.00e-07 ***
## Dim.3        -0.8409     0.3055  -2.752 0.005915 ** 
## Dim.4         0.7382     0.2432   3.036 0.002400 ** 
## Dim.5         1.7634     0.5349   3.297 0.000978 ***
## Dim.6         0.4458     0.3335   1.337 0.181296    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 602.315  on 455  degrees of freedom
## Residual deviance:  63.232  on 449  degrees of freedom
## AIC: 77.232
## 
## Number of Fisher Scoring iterations: 10
predicted.classes.pca <- model.ml.pca %>% predict(df.test.pca)
matrix.ml.pca <- confusionMatrix(predicted.classes.pca, df.test.pca$diagnosis)
matrix.ml.pca
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 70  3
##          M  1 39
##                                           
##                Accuracy : 0.9646          
##                  95% CI : (0.9118, 0.9903)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9235          
##                                           
##  Mcnemar's Test P-Value : 0.6171          
##                                           
##             Sensitivity : 0.9859          
##             Specificity : 0.9286          
##          Pos Pred Value : 0.9589          
##          Neg Pred Value : 0.9750          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6195          
##    Detection Prevalence : 0.6460          
##       Balanced Accuracy : 0.9572          
##                                           
##        'Positive' Class : B               
## 

5. SVM

# Support Vector Machine Original Data
set.seed(1234)
model.svm <- train(
  diagnosis ~., data = df.train, method = "svmLinear",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )

predicted.classes <- model.svm %>% predict(df.test)
matrix.svm <- confusionMatrix(predicted.classes, df.test$diagnosis)
matrix.svm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71  4
##          M  0 38
##                                           
##                Accuracy : 0.9646          
##                  95% CI : (0.9118, 0.9903)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9227          
##                                           
##  Mcnemar's Test P-Value : 0.1336          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9048          
##          Pos Pred Value : 0.9467          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6283          
##    Detection Prevalence : 0.6637          
##       Balanced Accuracy : 0.9524          
##                                           
##        'Positive' Class : B               
## 
# PCA
set.seed(1234)
model.svm.pca <- train(
  diagnosis ~., data = df.train.pca, method = "svmLinear",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )

predicted.classes.pca <- model.svm.pca %>% predict(df.test.pca)
matrix.svm.pca <- confusionMatrix(predicted.classes.pca, df.test.pca$diagnosis)
matrix.svm.pca
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71  3
##          M  0 39
##                                           
##                Accuracy : 0.9735          
##                  95% CI : (0.9244, 0.9945)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9423          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9286          
##          Pos Pred Value : 0.9595          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6283          
##    Detection Prevalence : 0.6549          
##       Balanced Accuracy : 0.9643          
##                                           
##        'Positive' Class : B               
## 


6. Conclusions

#KNN
knn.1 <- as.data.frame(matrix.knn$overall["Accuracy"])
colnames(knn.1) <- ""
knn.2 <- as.data.frame(matrix.knn$byClass[1:4])
colnames(knn.2) <- ""
knn <- rbind(knn.1, knn.2)
colnames(knn) <- "KNN"

knn.pca1 <- as.data.frame(matrix.knn.pca$overall["Accuracy"])
colnames(knn.pca1) <- ""
knn.pca2 <- as.data.frame(matrix.knn.pca$byClass[1:4])
colnames(knn.pca2) <- ""
knn.pca <- rbind(knn.pca1, knn.pca2)
colnames(knn.pca) <- "KNN PCA"
row.names(knn.pca) <- c()

# CART
tree.1 <- as.data.frame(matrix.tree$overall["Accuracy"])
colnames(tree.1) <- ""
tree.2 <- as.data.frame(matrix.tree$byClass[1:4])
colnames(tree.2) <- ""
tree <- rbind(tree.1, tree.2)
colnames(tree) <- "CART"
row.names(tree) <- c()

tree.p.1 <- as.data.frame(matrix.tree.p$overall["Accuracy"])
colnames(tree.p.1) <- ""
tree.p.2 <- as.data.frame(matrix.tree.p$byClass[1:4])
colnames(tree.p.2) <- ""
tree.p <- rbind(tree.p.1, tree.p.2)
colnames(tree.p) <- "CART Pruned"
row.names(tree.p) <- c()

tree.pca1 <- as.data.frame(matrix.tree.pca$overall["Accuracy"])
colnames(tree.pca1) <- ""
tree.pca2 <- as.data.frame(matrix.tree.pca$byClass[1:4])
colnames(tree.pca2) <- ""
tree.pca <- rbind(tree.pca1, tree.pca2)
colnames(tree.pca) <- "CART PCA"
row.names(tree.pca) <- c()

#RF 
rf.1 <- as.data.frame(matrix.rf$overall["Accuracy"])
colnames(rf.1) <- ""
rf.2 <- as.data.frame(matrix.rf$byClass[1:4])
colnames(rf.2) <- ""
rf <- rbind(rf.1, rf.2)
colnames(rf) <- "RF"
row.names(rf) <- c()

rf2.1 <- as.data.frame(matrix.rf2$overall["Accuracy"])
colnames(rf2.1) <- ""
rf2.2 <- as.data.frame(matrix.rf2$byClass[1:4])
colnames(rf2.2) <- ""
rf2 <- rbind(rf2.1, rf2.2)
colnames(rf2) <- "RF (TOP 5)"
row.names(rf2) <- c()

rf.pca1 <- as.data.frame(matrix.rf.pca$overall["Accuracy"])
colnames(rf.pca1) <- ""
rf.pca2 <- as.data.frame(matrix.rf.pca$byClass[1:4])
colnames(rf.pca2) <- ""
rf.pca <- rbind(rf.pca1, rf.pca2)
colnames(rf.pca) <- "RF PCA"
row.names(rf.pca) <- c()

# Logit
ml.pca1 <- as.data.frame(matrix.ml.pca$overall["Accuracy"])
colnames(ml.pca1) <- ""
ml.pca2 <- as.data.frame(matrix.ml.pca$byClass[1:4])
colnames(ml.pca2) <- ""
ml.pca <- rbind(ml.pca1, ml.pca2)
colnames(ml.pca) <- "Logit PCA"
row.names(ml.pca) <- c()

#SVM
svm.1 <- as.data.frame(matrix.svm$overall["Accuracy"])
colnames(svm.1) <- ""
svm.2 <- as.data.frame(matrix.svm$byClass[1:4])
colnames(svm.2) <- ""
svm <- rbind(svm.1, svm.2)
colnames(svm) <- "SVM"
row.names(svm) <- c()

svm.pca1 <- as.data.frame(matrix.svm.pca$overall["Accuracy"])
colnames(svm.pca1) <- ""
svm.pca2 <- as.data.frame(matrix.svm.pca$byClass[1:4])
colnames(svm.pca2) <- ""
svm.pca <- rbind(svm.pca1, svm.pca2)
colnames(svm.pca) <- "SVM PCA"
row.names(svm.pca) <- c()

final <- as.data.frame(t(cbind(knn, knn.pca, tree, tree.p, tree.pca, rf, rf2, rf.pca, ml.pca, svm, svm.pca)))

as.datatable(formattable(final, list(
            Accuracy = color_tile("#e6ad9c","#df5227"),
            Sensitivity = color_tile("#b5d7eb","#56B4E9"),
            Specificity = color_tile("#b5d7eb","#56B4E9"),
            `Pos Pred Value` = color_tile("#f2dfa2","#fec306"),
            `Neg Pred Value` = color_tile("#f2dfa2","#fec306")
            )), options = list(pageLength =11, dom = 'tip'))

The Support Vector Machine model with PCA pre-processing performed better:

The Random Forest, Regression Logistic with PCA pre-processing, KNN and SVM models achieved a very good accuracy as well, 96.4%.

In general, the models performed with a range accuracy of 90.2% - 97.3% with a good balance of sensitivity and specificity. By balance we mean similar levels of performance.

Finally, taking into account the analysis conducted, we’d like to point out that the following cell nuclei characteristics seem to be the most relevant for diagnosis of breast cancer through the fine needle aspiration (FNA) procedure:

We hope you enjoy this kernel…

If you have any question or suggestion about this project, we will be appreciate to receive them…